from:"Nick Piggin"

Re: oops when using git gc --auto

2008-02-26 Thread Nick Piggin

On Wednesday 27 February 2008 00:22, Otavio Salvador wrote:
> Hello,
>
> Today I got this oops, someone has an idea of what's going wrong?
>
> Unable to handle kernel paging request at 0200 RIP:
>  [] find_get_pages+0x3c/0x69

At this point, the most likely candidate is a memory corruption
error, probably hardware. Can you run memtest86 for a few hours
to get a bit more confidence in the hw (preferably overnight)?

I did recently see another quite similar corruption in the
pagecache radix-tree, though. Coincidence maybe?

> PGD 0
> Oops:  [1] SMP
> CPU 3
> Modules linked in: sha256_generic aes_generic aes_x86_64 cbc blkcipher
> nvidia(P) rfcomm l2cap bluetooth ac battery ipv6 nfs lockd nfs_acl sunrpc
> bridge ext2 mbcache dm_crypt tun kvm_intel kvm loop snd_usb_audio
> snd_usb_lib snd_rawmidi snd_hda_intel e1000e i2c_i801 serio_raw
> snd_seq_device snd_pcm intel_agp button snd_timer pcspkr psmouse snd_hwdep
> snd snd_page_alloc soundcore evdev i2c_core xfs dm_mirror dm_snapshot
> dm_mod raid0 md_mod sg sr_mod cdrom sd_mod usbhid hid usb_storage
> pata_marvell floppy ahci ata_generic libata scsi_mod ehci_hcd uhci_hcd
> thermal processor fan Pid: 15684, comm: git Tainted: P   
> 2.6.24-1-amd64 #1
> RIP: 0010:[]  []
> find_get_pages+0x3c/0x69 RSP: 0018:8100394dfd98  EFLAGS: 00010097
> RAX: 0009 RBX: 000e RCX: 0009
> RDX: 0200 RSI: 000a RDI: 0040
> RBP: 810042964350 R08: 0040 R09: 000a
> R10: 8100425a06c8 R11: 000a R12: 000e
> R13: 8100394dfdf8 R14: 810042964350 R15: 
> FS:  2ae326df2190() GS:81007d7aeb40()
> knlGS: CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 0200 CR3: 358f9000 CR4: 26e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process git (pid: 15684, threadinfo 8100394de000, task
> 8100359cd800) Stack:  000d 8100394dfde8
> 000d 000e 000e 802794d6
> 8100014a7768 80279b04  
>   Call Trace:
>  [] pagevec_lookup+0x17/0x1e
>  [] truncate_inode_pages_range+0x108/0x2bd
>  [] generic_delete_inode+0xbf/0x127
>  [] do_unlinkat+0xd5/0x144
>  [] sys_write+0x45/0x6e
>  [] system_call+0x7e/0x83
>
>
> Code: 48 8b 02 25 00 40 02 00 48 3d 00 40 02 00 75 04 48 8b 52 10
> RIP  [] find_get_pages+0x3c/0x69
>  RSP 
> CR2: 0200
> ---[ end trace cb43a9f4488b815a ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:21, Gleb Natapov wrote:
> On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote:
> > > You are missing one point here.  The MPI specifications that have
> > > been out there for decades do not require the process use a library
> > > for allocating the buffer.  I realize that is a horrible shortcoming,
> > > but that is the world we live in.  Even if we could change that spec,
> >
> > Can you change the spec?
>
> Not really. It will break all existing codes.

I meant as in eg. submit changes to MPI-3


> MPI-2 provides a call for 
> memory allocation (and it's beneficial to use this call for some
> interconnects), but many (most?) applications are still written for MPI-1
> and those that are written for MPI-2 mostly uses the old habit of
> allocating memory by malloc(), or even use stack or BSS memory for
> communication buffer purposes.

OK, so MPI-2 already has some way to do that... I'm not saying that we
can now completely dismiss the idea of using notifiers for this, but it
is just a good data point to know.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
> Andrew Morton wrote:
> > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> 
wrote:
> > > (It would be nicer if sync_file_range()
> > > took a vector of ranges for better elevator scheduling, but let's
> > > ignore that :-)
> >
> > Two passes:
> >
> > Pass 1: shove each of the segments into the queue with
> > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> >
> > Pass 2: wait for them all to complete and return accumulated result
> > with SYNC_FILE_RANGE_WAIT_AFTER
>
> Thanks.
>
> Seems ok, though being able to cork the I/O until the last one would
> be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)
>
> I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
> reason why you have it there?  The man page isn't very enlightening.

Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, "writeout", to userspace (does "writeout" include
"in drive cache"? the kernel doesn't think so, but the only way to
make sync_file_range "safe" is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
 Andrew Morton wrote:
  On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] 
wrote:
   (It would be nicer if sync_file_range()
   took a vector of ranges for better elevator scheduling, but let's
   ignore that :-)
 
  Two passes:
 
  Pass 1: shove each of the segments into the queue with
  SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
 
  Pass 2: wait for them all to complete and return accumulated result
  with SYNC_FILE_RANGE_WAIT_AFTER

 Thanks.

 Seems ok, though being able to cork the I/O until the last one would
 be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

 I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
 reason why you have it there?  The man page isn't very enlightening.


Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, writeout, to userspace (does writeout include
in drive cache? the kernel doesn't think so, but the only way to
make sync_file_range safe is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:21, Gleb Natapov wrote:
 On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote:
   You are missing one point here.  The MPI specifications that have
   been out there for decades do not require the process use a library
   for allocating the buffer.  I realize that is a horrible shortcoming,
   but that is the world we live in.  Even if we could change that spec,
 
  Can you change the spec?

 Not really. It will break all existing codes.

I meant as in eg. submit changes to MPI-3


 MPI-2 provides a call for 
 memory allocation (and it's beneficial to use this call for some
 interconnects), but many (most?) applications are still written for MPI-1
 and those that are written for MPI-2 mostly uses the old habit of
 allocating memory by malloc(), or even use stack or BSS memory for
 communication buffer purposes.

OK, so MPI-2 already has some way to do that... I'm not saying that we
can now completely dismiss the idea of using notifiers for this, but it
is just a good data point to know.

Thanks,
Nick

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: oops when using git gc --auto

2008-02-26 Thread Nick Piggin

On Wednesday 27 February 2008 00:22, Otavio Salvador wrote:
 Hello,

 Today I got this oops, someone has an idea of what's going wrong?

 Unable to handle kernel paging request at 0200 RIP:
  [802735c3] find_get_pages+0x3c/0x69

At this point, the most likely candidate is a memory corruption
error, probably hardware. Can you run memtest86 for a few hours
to get a bit more confidence in the hw (preferably overnight)?

I did recently see another quite similar corruption in the
pagecache radix-tree, though. Coincidence maybe?

 PGD 0
 Oops:  [1] SMP
 CPU 3
 Modules linked in: sha256_generic aes_generic aes_x86_64 cbc blkcipher
 nvidia(P) rfcomm l2cap bluetooth ac battery ipv6 nfs lockd nfs_acl sunrpc
 bridge ext2 mbcache dm_crypt tun kvm_intel kvm loop snd_usb_audio
 snd_usb_lib snd_rawmidi snd_hda_intel e1000e i2c_i801 serio_raw
 snd_seq_device snd_pcm intel_agp button snd_timer pcspkr psmouse snd_hwdep
 snd snd_page_alloc soundcore evdev i2c_core xfs dm_mirror dm_snapshot
 dm_mod raid0 md_mod sg sr_mod cdrom sd_mod usbhid hid usb_storage
 pata_marvell floppy ahci ata_generic libata scsi_mod ehci_hcd uhci_hcd
 thermal processor fan Pid: 15684, comm: git Tainted: P   
 2.6.24-1-amd64 #1
 RIP: 0010:[802735c3]  [802735c3]
 find_get_pages+0x3c/0x69 RSP: 0018:8100394dfd98  EFLAGS: 00010097
 RAX: 0009 RBX: 000e RCX: 0009
 RDX: 0200 RSI: 000a RDI: 0040
 RBP: 810042964350 R08: 0040 R09: 000a
 R10: 8100425a06c8 R11: 000a R12: 000e
 R13: 8100394dfdf8 R14: 810042964350 R15: 
 FS:  2ae326df2190() GS:81007d7aeb40()
 knlGS: CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0200 CR3: 358f9000 CR4: 26e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process git (pid: 15684, threadinfo 8100394de000, task
 8100359cd800) Stack:  000d 8100394dfde8
 000d 000e 000e 802794d6
 8100014a7768 80279b04  
   Call Trace:
  [802794d6] pagevec_lookup+0x17/0x1e
  [80279b04] truncate_inode_pages_range+0x108/0x2bd
  [802a9e3a] generic_delete_inode+0xbf/0x127
  [802a1a4d] do_unlinkat+0xd5/0x144
  [802989e2] sys_write+0x45/0x6e
  [8020be2e] system_call+0x7e/0x83


 Code: 48 8b 02 25 00 40 02 00 48 3d 00 40 02 00 75 04 48 8b 52 10
 RIP  [802735c3] find_get_pages+0x3c/0x69
  RSP 8100394dfd98
 CR2: 0200
 ---[ end trace cb43a9f4488b815a ]---

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-25 Thread Nick Piggin

On Thursday 21 February 2008 21:58, Robin Holt wrote:
> On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote:
> > > > So why can't you export a device from your xpmem driver, which
> > > > can be mmap()ed to give out "anonymous" memory pages to be used
> > > > for these communication buffers?
> > >
> > > Because we need to have heap and stack available as well.  MPT does
> > > not control all the communication buffer areas.  I haven't checked, but
> > > this is the same problem that IB will have.  I believe they are
> > > actually allowing any memory region be accessible, but I am not sure of
> > > that.
> >
> > Then you should create a driver that the user program can register
> > and unregister regions of their memory with. The driver can do a
> > get_user_pages to get the pages, and then you'd just need to set up
> > some kind of mapping so that userspace can unmap pages / won't leak
> > memory (and an exit_mm notifier I guess).
>
> OK.  You need to explain this better to me.  How would this driver
> supposedly work?  What we have is an MPI library.  It gets invoked at
> process load time to establish its rank-to-rank communication regions.
> It then turns control over to the processes main().  That is allowed to
> run until it hits the
>   MPI_Init(, );
>
> The process is then totally under the users control until:
>   MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, 
> MPI_COMM_WORLD);
>   MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD,
> );
>
> That is it.  That is all our allowed interaction with the users process.

OK, when you said something along the lines of "the MPT library has
control of the comm buffer", then I assumed it was an area of virtual
memory which is set up as part of initialization, rather than during
runtime. I guess I jumped to conclusions.


> That doesn't seem too unreasonable, except when you compare it to how the
> driver currently works.  Remember, this is done from a library which has
> no insight into what the user has done to its own virtual address space.
> As a result, each MPI_Send() would result in a system call (or we would
> need to have a set of callouts for changes to a processes VMAs) which
> would be a significant increase in communication overhead.
>
> Maybe I am missing what you intend to do, but what we need is a means of
> tracking one processes virtual address space changes so other processes
> can do direct memory accesses without the need for a system call on each
> communication event.

Yeah it's tricky. BTW. what is the performance difference between
having a system call or no?


> > Because you don't need to swap, you don't need coherency, and you
> > are in control of the areas, then this seems like the best choice.
> > It would allow you to use heap, stack, file-backed, anything.
>
> You are missing one point here.  The MPI specifications that have
> been out there for decades do not require the process use a library
> for allocating the buffer.  I realize that is a horrible shortcoming,
> but that is the world we live in.  Even if we could change that spec,

Can you change the spec? Are you working on it?


> we would still need to support the existing specs.  As a result, the
> user can change their virtual address space as they need and still expect
> communications be cheap.

That's true. How has it been supported up to now? Are you using
these kind of notifiers in patched kernels?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-25 Thread Nick Piggin

On Wednesday 20 February 2008 09:01, Alexey Dobriyan wrote:
> On Tue, Feb 19, 2008 at 11:47:11PM +0300,  wrote:

> > > Are you reproducing it simply by running the
> > > ftest03 binary directly from the shell? How many times between oopses?
> > > It is multi-process but no threads, so races should be minimal down
> > > this path -- can you get an strace of the failing process?
>
> Speaking of multi-proceseness, changing MAXCHILD to 1, nchild to 1,
> AFAICS, generates one child which oopses the very same way (in parallel
> with generic LTP) But, lowering MAXIOVCNT to 8 generates no oops.

Thanks, I was able to reproduce quite easily with these settings.
I think I have the correct patch now (at least it isn't triggerable
any more here).

Thanks,
Nick
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c74b68..2650073 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1750,14 +1750,18 @@ static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
 	} else {
 		const struct iovec *iov = i->iov;
 		size_t base = i->iov_offset;
+		size_t copied = 0;
 
 		/*
 		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments (without overruning the iovec).
 		 */
-		while (bytes || !iov->iov_len) {
-			int copy = min(bytes, iov->iov_len - base);
+		while (copied < bytes ||
+unlikely(!iov->iov_len && copied < i->count)) {
+			int copy;
 
+			copy = min(bytes, iov->iov_len - base);
+			copied += copy;
 			bytes -= copy;
 			base += copy;
 			if (iov->iov_len == base) {

Re: 2.6.24-sha1: RIP [ffffffff802596c8] iov_iter_advance+0x38/0x70

2008-02-25 Thread Nick Piggin

On Wednesday 20 February 2008 09:01, Alexey Dobriyan wrote:
 On Tue, Feb 19, 2008 at 11:47:11PM +0300,  wrote:

   Are you reproducing it simply by running the
   ftest03 binary directly from the shell? How many times between oopses?
   It is multi-process but no threads, so races should be minimal down
   this path -- can you get an strace of the failing process?

 Speaking of multi-proceseness, changing MAXCHILD to 1, nchild to 1,
 AFAICS, generates one child which oopses the very same way (in parallel
 with generic LTP) But, lowering MAXIOVCNT to 8 generates no oops.

Thanks, I was able to reproduce quite easily with these settings.
I think I have the correct patch now (at least it isn't triggerable
any more here).

Thanks,
Nick
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c74b68..2650073 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1750,14 +1750,18 @@ static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
 	} else {
 		const struct iovec *iov = i-iov;
 		size_t base = i-iov_offset;
+		size_t copied = 0;
 
 		/*
 		 * The !iov-iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments (without overruning the iovec).
 		 */
-		while (bytes || !iov-iov_len) {
-			int copy = min(bytes, iov-iov_len - base);
+		while (copied  bytes ||
+unlikely(!iov-iov_len  copied  i-count)) {
+			int copy;
 
+			copy = min(bytes, iov-iov_len - base);
+			copied += copy;
 			bytes -= copy;
 			base += copy;
 			if (iov-iov_len == base) {

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-25 Thread Nick Piggin

On Thursday 21 February 2008 21:58, Robin Holt wrote:
 On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote:
So why can't you export a device from your xpmem driver, which
can be mmap()ed to give out anonymous memory pages to be used
for these communication buffers?
  
   Because we need to have heap and stack available as well.  MPT does
   not control all the communication buffer areas.  I haven't checked, but
   this is the same problem that IB will have.  I believe they are
   actually allowing any memory region be accessible, but I am not sure of
   that.
 
  Then you should create a driver that the user program can register
  and unregister regions of their memory with. The driver can do a
  get_user_pages to get the pages, and then you'd just need to set up
  some kind of mapping so that userspace can unmap pages / won't leak
  memory (and an exit_mm notifier I guess).

 OK.  You need to explain this better to me.  How would this driver
 supposedly work?  What we have is an MPI library.  It gets invoked at
 process load time to establish its rank-to-rank communication regions.
 It then turns control over to the processes main().  That is allowed to
 run until it hits the
   MPI_Init(argc, argv);

 The process is then totally under the users control until:
   MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, 
 MPI_COMM_WORLD);
   MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD,
 status);

 That is it.  That is all our allowed interaction with the users process.

OK, when you said something along the lines of the MPT library has
control of the comm buffer, then I assumed it was an area of virtual
memory which is set up as part of initialization, rather than during
runtime. I guess I jumped to conclusions.


 That doesn't seem too unreasonable, except when you compare it to how the
 driver currently works.  Remember, this is done from a library which has
 no insight into what the user has done to its own virtual address space.
 As a result, each MPI_Send() would result in a system call (or we would
 need to have a set of callouts for changes to a processes VMAs) which
 would be a significant increase in communication overhead.

 Maybe I am missing what you intend to do, but what we need is a means of
 tracking one processes virtual address space changes so other processes
 can do direct memory accesses without the need for a system call on each
 communication event.

Yeah it's tricky. BTW. what is the performance difference between
having a system call or no?


  Because you don't need to swap, you don't need coherency, and you
  are in control of the areas, then this seems like the best choice.
  It would allow you to use heap, stack, file-backed, anything.

 You are missing one point here.  The MPI specifications that have
 been out there for decades do not require the process use a library
 for allocating the buffer.  I realize that is a horrible shortcoming,
 but that is the world we live in.  Even if we could change that spec,

Can you change the spec? Are you working on it?


 we would still need to support the existing specs.  As a result, the
 user can change their virtual address space as they need and still expect
 communications be cheap.

That's true. How has it been supported up to now? Are you using
these kind of notifiers in patched kernels?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] alloc_percpu() fails to allocate percpu data

2008-02-23 Thread Nick Piggin

On Friday 22 February 2008 09:26, Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
> > Some oprofile results obtained while using tbench on a 2x2 cpu machine
> > were very surprising.
> >
> > For example, loopback_xmit() function was using high number of cpu
> > cycles to perform the statistic updates, supposed to be real cheap
> > since they use percpu data
> >
> > pcpu_lstats = netdev_priv(dev);
> > lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> > lb_stats->packets++;  /* HERE : serious contention */
> > lb_stats->bytes += skb->len;
> >
> >
> > struct pcpu_lstats is a small structure containing two longs. It
> > appears that on my 32bits platform, alloc_percpu(8) allocates a single
> > cache line,  instead of giving to each cpu a separate cache line.
> >
> > Using the following patch gave me impressive boost in various
> > benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
> >
> > Long term fix (ie >= 2.6.26) would be to let each CPU allocate their
> > own block of memory, so that we dont need to roudup sizes to
> > L1_CACHE_BYTES, or merging the SGI stuff of course...
> >
> > Note : SLUB vs SLAB is important here to *show* the improvement, since
> > they dont have the same minimum allocation sizes (8 bytes vs 32
> > bytes). This could very well explain regressions some guys reported
> > when they switched to SLUB.
>
> I've complained about this false sharing as well, so until we get the
> new and improved percpu allocators,

What I don't understand is why the slab allocators have something like
this in it:

if ((flags & SLAB_HWCACHE_ALIGN) &&
size > cache_line_size() / 2)
return max_t(unsigned long, align, cache_line_size());

If you ask for HWCACHE_ALIGN, then you should get it. I don't
understand, why do they think they knows better than the caller?
Things like this are just going to lead to very difficult to track
performance problems. Possibly correctness problems in rare cases.

There could be another flag for "maybe align".

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] alloc_percpu() fails to allocate percpu data

2008-02-23 Thread Nick Piggin

On Friday 22 February 2008 09:26, Peter Zijlstra wrote:
 On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
  Some oprofile results obtained while using tbench on a 2x2 cpu machine
  were very surprising.
 
  For example, loopback_xmit() function was using high number of cpu
  cycles to perform the statistic updates, supposed to be real cheap
  since they use percpu data
 
  pcpu_lstats = netdev_priv(dev);
  lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
  lb_stats-packets++;  /* HERE : serious contention */
  lb_stats-bytes += skb-len;
 
 
  struct pcpu_lstats is a small structure containing two longs. It
  appears that on my 32bits platform, alloc_percpu(8) allocates a single
  cache line,  instead of giving to each cpu a separate cache line.
 
  Using the following patch gave me impressive boost in various
  benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
 
  Long term fix (ie = 2.6.26) would be to let each CPU allocate their
  own block of memory, so that we dont need to roudup sizes to
  L1_CACHE_BYTES, or merging the SGI stuff of course...
 
  Note : SLUB vs SLAB is important here to *show* the improvement, since
  they dont have the same minimum allocation sizes (8 bytes vs 32
  bytes). This could very well explain regressions some guys reported
  when they switched to SLUB.

 I've complained about this false sharing as well, so until we get the
 new and improved percpu allocators,

What I don't understand is why the slab allocators have something like
this in it:

if ((flags  SLAB_HWCACHE_ALIGN) 
size  cache_line_size() / 2)
return max_t(unsigned long, align, cache_line_size());

If you ask for HWCACHE_ALIGN, then you should get it. I don't
understand, why do they think they knows better than the caller?
Things like this are just going to lead to very difficult to track
performance problems. Possibly correctness problems in rare cases.

There could be another flag for maybe align.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Document huge memory/cache overhead of memory controller in Kconfig

2008-02-21 Thread Nick Piggin

On Wednesday 20 February 2008 23:52, Balbir Singh wrote:
> Andi Kleen wrote:
> > Document huge memory/cache overhead of memory controller in Kconfig
> >
> > I was a little surprised that 2.6.25-rc* increased struct page for the
> > memory controller.  At least on many x86-64 machines it will not fit into
> > a single cache line now anymore and also costs considerable amounts of
> > RAM.
>
> The size of struct page earlier was 56 bytes on x86_64 and with 64 bytes it
> won't fit into the cacheline anymore? Please also look at
> http://lwn.net/Articles/234974/

BTW. We'll probably want to increase the width of some counters
in struct page at some point for 64-bit, so then it really will
go over with the memory controller!

Actually, an external data structure is a pretty good idea. We
could probably do it easily with a radix tree (pfn->memory
controller). And that might be a better option for distros.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Document huge memory/cache overhead of memory controller in Kconfig

2008-02-21 Thread Nick Piggin

On Wednesday 20 February 2008 23:52, Balbir Singh wrote:
 Andi Kleen wrote:
  Document huge memory/cache overhead of memory controller in Kconfig
 
  I was a little surprised that 2.6.25-rc* increased struct page for the
  memory controller.  At least on many x86-64 machines it will not fit into
  a single cache line now anymore and also costs considerable amounts of
  RAM.

 The size of struct page earlier was 56 bytes on x86_64 and with 64 bytes it
 won't fit into the cacheline anymore? Please also look at
 http://lwn.net/Articles/234974/

BTW. We'll probably want to increase the width of some counters
in struct page at some point for 64-bit, so then it really will
go over with the memory controller!

Actually, an external data structure is a pretty good idea. We
could probably do it easily with a radix tree (pfn-memory
controller). And that might be a better option for distros.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v6

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 01:03:24PM +0100, Andrea Arcangeli wrote:
> If there's agreement that the VM should alter its locking from
> spinlock to mutex for its own good, then Christoph's
> one-config-option-fits-all becomes a lot more appealing (replacing RCU
> with a mutex in the mmu notifier list registration locking isn't my
> main worry and the non-sleeping-users may be ok to live with it).

Just from a high level view, in some cases we can just say that no we
aren't going to support this. And this may well be one of those cases.

The more constraints placed on the VM, the harder it becomes to
improve and adapt in future. And this seems like a pretty big restriction.
(especially if we can eg. work around it completely by having a special
purpose driver to get_user_pages on comm buffers as I suggested in the
other mail).

At any rate, I believe Andrea's patch really places minimal or no further
constraints than a regular CPU TLB (or the hash tables that some archs
implement). So we're kind of in 2 different leagues here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v6

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 11:39:42AM +0100, Andrea Arcangeli wrote:
> Given Nick's comments I ported my version of the mmu notifiers to
> latest mainline. There are no known bugs AFIK and it's obviously safe
> (nothing is allowed to schedule inside rcu_read_lock taken by
> mmu_notifier() with my patch).

Thanks! Yes the seqlock you are using now ends up looking similar
to what I did and I couldn't find a hole in that either. So I
think this is going to work.

I do prefer some parts of my patch, however for everyone's sanity,
I think you should be the maintainer of the mmu notifiers, and I
will send you incremental changes that can be discussed more easily
that way (nothing major, mainly style and minor things).

> XPMEM simply can't use RCU for the registration locking if it wants to
> schedule inside the mmu notifier calls. So I guess it's better to add
> the XPMEM invalidate_range_end/begin/external-rmap as a whole
> different subsystem that will have to use a mutex (not RCU) to
> serialize, and at the same time that CONFIG_XPMEM will also have to
> switch the i_mmap_lock to a mutex. I doubt xpmem fits inside a
> CONFIG_MMU_NOTIFIER anymore, or we'll all run a bit slower because of
> it. It's really a call of how much we want to optimize the MMU
> notifier, by keeping things like RCU for the registration.

I agree: your coherent, non-sleeping mmu notifiers are pretty simple
and unintrusive. The sleeping version is fundamentally going to either
need to change VM locks, or be non-coherent, so I don't think there is
a question of making one solution fit everybody. So the sleeping /
xrmap patch should be kept either completely independent, or as an
add-on to this one.

I will post some suggestions to you when I get a chance.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 02:09:41AM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
> > Sorry, I realise I still didn't get this through my head yet (and also
> > have not seen your patch recently). So I don't know exactly what you
> > are doing...
> 
> The last version was posted here:
> 
> http://marc.info/?l=kvm-devel=120321732521533=2
> 
> > But why does _anybody_ (why does Christoph's patches) need to invalidate
> > when they are going to be more permissive? This should be done lazily by
> > the driver, I would have thought.
> 
> This can be done lazily by the driver yes. The place where I've an
> invalidate_pages in mprotect however can also become less permissive.

That's OK, because we have to flush tlbs there too.


> It's simpler to invalidate always and it's not guaranteed the
> secondary mmu page fault is capable of refreshing the spte across a
> writeprotect fault.

I think we just have to make sure that it _can_ do writeprotect
faults. AFAIKS, that will be possible if the driver registers a
.page_mkwrite handler (actually not quite -- page_mkwrite is fairly
crap, so I have a patch to merge it together with .fault so we get
address information as well). Anyway, I really think we should do
it that way.

> In the future this can be changed to
> mprotect_pages though, so no page fault will happen in the secondary
> mmu.

Possibly, but hopefully not needed for performance. Let's wait and
see.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-20 Thread Nick Piggin

On Tue, Feb 19, 2008 at 05:40:50PM -0600, Jack Steiner wrote:
> On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
> > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > > > anything when changing the pte to be _more_ permissive, and I don't
> > > 
> > > Note that in my patch the invalidate_pages in mprotect can be
> > > trivially switched to a mprotect_pages with proper params. This will
> > > prevent page faults completely in the secondary MMU (there will only
> > > be tlb misses after the tlb flush just like for the core linux pte),
> > > and it'll allow all the secondary MMU pte blocks (512/1024 at time
> > > with my PT lock design) to be updated to have proper permissions
> > > matching the core linux pte.
> > 
> > Sorry, I realise I still didn't get this through my head yet (and also
> > have not seen your patch recently). So I don't know exactly what you
> > are doing...
> > 
> > But why does _anybody_ (why does Christoph's patches) need to invalidate
> > when they are going to be more permissive? This should be done lazily by
> > the driver, I would have thought.
> 
> 
> Agree. Although for most real applications, the performance difference
> is probably negligible.

But importantly, doing it that way means you share test coverage with
the CPU TLB flushing code, and you don't introduce a new concept to the
VM.

So, it _has_ to be lazy flushing, IMO (as there doesn't seem to be a
good reason otherwise). mprotect shouldn't really be a special case,
because it still has to flush the CPU tlbs as well when restricting
access.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-20 Thread Nick Piggin

On Wednesday 20 February 2008 20:00, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:51:45PM +1100, Nick Piggin wrote:
> > On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> > > For XPMEM, we do not currently allow file backed
> > > mapping pages from being exported so we should never reach this
> > > condition. It has been an issue since day 1.  We have operated with
> > > that assumption for 6 years and have not had issues with that
> > > assumption.  The user of xpmem is MPT and it controls the communication
> > > buffers so it is reasonable to expect this type of behavior.
> >
> > OK, that makes things simpler.
> >
> > So why can't you export a device from your xpmem driver, which
> > can be mmap()ed to give out "anonymous" memory pages to be used
> > for these communication buffers?
>
> Because we need to have heap and stack available as well.  MPT does
> not control all the communication buffer areas.  I haven't checked, but
> this is the same problem that IB will have.  I believe they are actually
> allowing any memory region be accessible, but I am not sure of that.

Then you should create a driver that the user program can register
and unregister regions of their memory with. The driver can do a
get_user_pages to get the pages, and then you'd just need to set up
some kind of mapping so that userspace can unmap pages / won't leak
memory (and an exit_mm notifier I guess).

Because you don't need to swap, you don't need coherency, and you
are in control of the areas, then this seems like the best choice.
It would allow you to use heap, stack, file-backed, anything.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-20 Thread Nick Piggin

On Wednesday 20 February 2008 20:00, Robin Holt wrote:
 On Wed, Feb 20, 2008 at 02:51:45PM +1100, Nick Piggin wrote:
  On Wednesday 20 February 2008 14:12, Robin Holt wrote:
   For XPMEM, we do not currently allow file backed
   mapping pages from being exported so we should never reach this
   condition. It has been an issue since day 1.  We have operated with
   that assumption for 6 years and have not had issues with that
   assumption.  The user of xpmem is MPT and it controls the communication
   buffers so it is reasonable to expect this type of behavior.
 
  OK, that makes things simpler.
 
  So why can't you export a device from your xpmem driver, which
  can be mmap()ed to give out anonymous memory pages to be used
  for these communication buffers?

 Because we need to have heap and stack available as well.  MPT does
 not control all the communication buffer areas.  I haven't checked, but
 this is the same problem that IB will have.  I believe they are actually
 allowing any memory region be accessible, but I am not sure of that.

Then you should create a driver that the user program can register
and unregister regions of their memory with. The driver can do a
get_user_pages to get the pages, and then you'd just need to set up
some kind of mapping so that userspace can unmap pages / won't leak
memory (and an exit_mm notifier I guess).

Because you don't need to swap, you don't need coherency, and you
are in control of the areas, then this seems like the best choice.
It would allow you to use heap, stack, file-backed, anything.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-20 Thread Nick Piggin

On Tue, Feb 19, 2008 at 05:40:50PM -0600, Jack Steiner wrote:
 On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
  On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
   On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
anything when changing the pte to be _more_ permissive, and I don't
   
   Note that in my patch the invalidate_pages in mprotect can be
   trivially switched to a mprotect_pages with proper params. This will
   prevent page faults completely in the secondary MMU (there will only
   be tlb misses after the tlb flush just like for the core linux pte),
   and it'll allow all the secondary MMU pte blocks (512/1024 at time
   with my PT lock design) to be updated to have proper permissions
   matching the core linux pte.
  
  Sorry, I realise I still didn't get this through my head yet (and also
  have not seen your patch recently). So I don't know exactly what you
  are doing...
  
  But why does _anybody_ (why does Christoph's patches) need to invalidate
  when they are going to be more permissive? This should be done lazily by
  the driver, I would have thought.
 
 
 Agree. Although for most real applications, the performance difference
 is probably negligible.

But importantly, doing it that way means you share test coverage with
the CPU TLB flushing code, and you don't introduce a new concept to the
VM.

So, it _has_ to be lazy flushing, IMO (as there doesn't seem to be a
good reason otherwise). mprotect shouldn't really be a special case,
because it still has to flush the CPU tlbs as well when restricting
access.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 02:09:41AM +0100, Andrea Arcangeli wrote:
 On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
  Sorry, I realise I still didn't get this through my head yet (and also
  have not seen your patch recently). So I don't know exactly what you
  are doing...
 
 The last version was posted here:
 
 http://marc.info/?l=kvm-develm=120321732521533w=2
 
  But why does _anybody_ (why does Christoph's patches) need to invalidate
  when they are going to be more permissive? This should be done lazily by
  the driver, I would have thought.
 
 This can be done lazily by the driver yes. The place where I've an
 invalidate_pages in mprotect however can also become less permissive.

That's OK, because we have to flush tlbs there too.


 It's simpler to invalidate always and it's not guaranteed the
 secondary mmu page fault is capable of refreshing the spte across a
 writeprotect fault.

I think we just have to make sure that it _can_ do writeprotect
faults. AFAIKS, that will be possible if the driver registers a
.page_mkwrite handler (actually not quite -- page_mkwrite is fairly
crap, so I have a patch to merge it together with .fault so we get
address information as well). Anyway, I really think we should do
it that way.

 In the future this can be changed to
 mprotect_pages though, so no page fault will happen in the secondary
 mmu.

Possibly, but hopefully not needed for performance. Let's wait and
see.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v6

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 11:39:42AM +0100, Andrea Arcangeli wrote:
 Given Nick's comments I ported my version of the mmu notifiers to
 latest mainline. There are no known bugs AFIK and it's obviously safe
 (nothing is allowed to schedule inside rcu_read_lock taken by
 mmu_notifier() with my patch).

Thanks! Yes the seqlock you are using now ends up looking similar
to what I did and I couldn't find a hole in that either. So I
think this is going to work.

I do prefer some parts of my patch, however for everyone's sanity,
I think you should be the maintainer of the mmu notifiers, and I
will send you incremental changes that can be discussed more easily
that way (nothing major, mainly style and minor things).


 XPMEM simply can't use RCU for the registration locking if it wants to
 schedule inside the mmu notifier calls. So I guess it's better to add
 the XPMEM invalidate_range_end/begin/external-rmap as a whole
 different subsystem that will have to use a mutex (not RCU) to
 serialize, and at the same time that CONFIG_XPMEM will also have to
 switch the i_mmap_lock to a mutex. I doubt xpmem fits inside a
 CONFIG_MMU_NOTIFIER anymore, or we'll all run a bit slower because of
 it. It's really a call of how much we want to optimize the MMU
 notifier, by keeping things like RCU for the registration.

I agree: your coherent, non-sleeping mmu notifiers are pretty simple
and unintrusive. The sleeping version is fundamentally going to either
need to change VM locks, or be non-coherent, so I don't think there is
a question of making one solution fit everybody. So the sleeping /
xrmap patch should be kept either completely independent, or as an
add-on to this one.

I will post some suggestions to you when I get a chance.

 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v6

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 01:03:24PM +0100, Andrea Arcangeli wrote:
 If there's agreement that the VM should alter its locking from
 spinlock to mutex for its own good, then Christoph's
 one-config-option-fits-all becomes a lot more appealing (replacing RCU
 with a mutex in the mmu notifier list registration locking isn't my
 main worry and the non-sleeping-users may be ok to live with it).

Just from a high level view, in some cases we can just say that no we
aren't going to support this. And this may well be one of those cases.

The more constraints placed on the VM, the harder it becomes to
improve and adapt in future. And this seems like a pretty big restriction.
(especially if we can eg. work around it completely by having a special
purpose driver to get_user_pages on comm buffers as I suggested in the
other mail).

At any rate, I believe Andrea's patch really places minimal or no further
constraints than a regular CPU TLB (or the hash tables that some archs
implement). So we're kind of in 2 different leagues here.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-19 Thread Nick Piggin

On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> For XPMEM, we do not currently allow file backed
> mapping pages from being exported so we should never reach this condition.
> It has been an issue since day 1.  We have operated with that assumption
> for 6 years and have not had issues with that assumption.  The user of
> xpmem is MPT and it controls the communication buffers so it is reasonable
> to expect this type of behavior.

OK, that makes things simpler.

So why can't you export a device from your xpmem driver, which
can be mmap()ed to give out "anonymous" memory pages to be used
for these communication buffers?

I guess you may also want an "munmap/mprotect" callback, which
we don't have in the kernel right now... but at least you could
prototype it easily by having an ioctl to be called before
munmapping or mprotecting (eg. the ioctl could prevent new TLB
setup for the region, and shoot down existing ones).

This is actually going to be much faster for you if you use any
threaded applications, because you will be able to do all the
shootdown round trips outside mmap_sem, and so you will be able
to have other threads faulting and even mmap()ing / munmaping
at the same time as the shootdown is happening.

I guess there is some catch...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Wednesday 20 February 2008 14:00, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:

> > > Also, how to you resolve the case where you are not allowed to sleep?
> > > I would have thought either you have to handle it, in which case nobody
> > > needs to sleep; or you can't handle it, in which case the code is
> > > broken.
> >
> > I also asked exactly this, glad you reasked this too.
>
> Currently, we BUG_ON having a PFN in our tables and not being able
> to sleep.  These are mappings which MPT has never supported in the past
> and XPMEM was already not allowing page faults for VMAs which are not
> anonymous so it should never happen.  If the file-backed operations can
> ever get changed to allow for sleeping and a customer has a need for it,
> we would need to change XPMEM to allow those types of faults to succeed.

Do you really want to be able to swap, or are you just interested
in keeping track of unmaps / prot changes?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> These special additional callbacks are required because XPmem (and likely
> other mechanisms) do use their own rmap (multiple processes on a series
> of remote Linux instances may be accessing the memory of a process).
> F.e. XPmem may have to send out notifications to remote Linux instances
> and receive confirmation before a page can be freed.
>
> So we handle this like an additional Linux reverse map that is walked after
> the existing rmaps have been walked. We leave the walking to the driver
> that is then able to use something else than a spinlock to walk its reverse
> maps. So we can actually call the driver without holding spinlocks while we
> hold the Pagelock.

I don't know how this is supposed to solve anything. The sleeping
problem happens I guess mostly in truncate. And all you are doing
is putting these rmap callbacks in page_mkclean and try_to_unmap.

> However, we cannot determine the mm_struct that a page belongs to at
> that point. The mm_struct can only be determined from the rmaps by the
> device driver.
>
> We add another pageflag (PageExternalRmap) that is set if a page has
> been remotely mapped (f.e. by a process from another Linux instance).
> We can then only perform the callbacks for pages that are actually in
> remote use.
>
> Rmap notifiers need an extra page bit and are only available
> on 64 bit platforms. This functionality is not available on 32 bit!
>
> A notifier that uses the reverse maps callbacks does not need to provide
> the invalidate_page() method that is called when locks are held.

That doesn't seem right. To start with, the new callbacks aren't
even called in the places where invalidate_page isn't allowed to
sleep.

The problem is unmap_mapping_range, right? And unmap_mapping_range
must walk the rmaps with the mmap lock held, which is why it can't
sleep. And it can't hold any mmap_sem so it cannot prevent address
space modifications of the processes in question between the time
you unmap them from the linux ptes with unmap_mapping_range, and the
time that you unmap them from your driver.

So in the meantime, you could have eg. a fault come in and set up a
new page for one of the processes, and that page might even get
exported via the same external driver. And now you have a totally
inconsistent view.

Preventing new mappings from being set up until the old mapping is
completely flushed is basically what we need to ensure for any sane
TLB as far as I can tell. To do that, you'll need to make the mmap
lock sleep, and either take mmap_sem inside it (which is a
deadlock condition at the moment), or make ptl sleep as well. These
are simply the locks we use to prevent that from happening, so I
can't see how you can possibly hope to have a coherent TLB without
invalidating inside those locks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > anything when changing the pte to be _more_ permissive, and I don't
> 
> Note that in my patch the invalidate_pages in mprotect can be
> trivially switched to a mprotect_pages with proper params. This will
> prevent page faults completely in the secondary MMU (there will only
> be tlb misses after the tlb flush just like for the core linux pte),
> and it'll allow all the secondary MMU pte blocks (512/1024 at time
> with my PT lock design) to be updated to have proper permissions
> matching the core linux pte.

Sorry, I realise I still didn't get this through my head yet (and also
have not seen your patch recently). So I don't know exactly what you
are doing...

But why does _anybody_ (why does Christoph's patches) need to invalidate
when they are going to be more permissive? This should be done lazily by
the driver, I would have thought.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
>
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.

You can't sleep inside rcu_read_lock()!

I must say that for a patch that is up to v8 or whatever and is
posted twice a week to such a big cc list, it is kind of slack to
not even test it and expect other people to review it.

Also, what we are going to need here are not skeleton drivers
that just do all the *easy* bits (of registering their callbacks),
but actual fully working examples that do everything that any
real driver will need to do. If not for the sanity of the driver
writer, then for the sanity of the VM developers (I don't want
to have to understand xpmem or infiniband in order to understand
how the VM works).



> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
>
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.
>
> Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
> Signed-off-by: Robin Holt <[EMAIL PROTECTED]>
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
>
> ---
>  mm/filemap_xip.c |5 +
>  mm/fremap.c  |3 +++
>  mm/hugetlb.c |3 +++
>  mm/memory.c  |   35 +--
>  mm/mmap.c|2 ++
>  mm/mprotect.c|3 +++
>  mm/mremap.c  |7 ++-
>  7 files changed, 51 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/mm/fremap.c
> ===
> --- linux-2.6.orig/mm/fremap.c2008-02-14 18:43:31.0 -0800
> +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.0 -0800
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
>   spin_unlock(>i_mmap_lock);
>   }
>
> + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
>   err = populate_range(mm, vma, start, size, pgoff);
> + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
>   if (!err && !(flags & MAP_NONBLOCK)) {
>   if (unlikely(has_write_lock)) {
>   downgrade_write(>mmap_sem);
> Index: linux-2.6/mm/memory.c
> ===
> --- linux-2.6.orig/mm/memory.c2008-02-14 18:43:31.0 -0800
> +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.0 -0800
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
>   if (is_vm_hugetlb_page(vma))
>   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
>
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
> +
>   dst_pgd = pgd_offset(dst_mm, addr);
>   src_pgd = pgd_offset(src_mm, addr);
>   do {
> @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
>   vma, addr, next))
>   return -ENOMEM;
>   } while (dst_pgd++, src_pgd++, addr = next, addr != end);
> +
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier(invalidate_range_end, src_mm,
> + vma->vm_start, end, 0);
> +
>   return 0;
>  }
>
> @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
>   struct mmu_gather *tlb;
>   unsigned long end = address + size;
>   unsigned long nr_accounted = 0;
> + int atomic = details ? (details->i_mmap_lock != 0) : 0;
>
>   lru_add_drain();
>   tlb = tlb_gather_mmu(mm, 0);
>   update_hiwater_rss(mm);
> + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
>   end = unmap_vmas(, vma, address, end, _accounted, details);
>   if (tlb)
>   tlb_finish_mmu(tlb, address, end);
> + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
>   return end;
>  }
>

Where do you invalidate for munmap()?

Also, how to you resolve the case where you are not allowed to sleep?
I would have thought either you have to handle it, in which case nobody
needs to sleep; or you can't handle it, in which case the code is
broken.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote:
> > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > > understand the need for invalidate_begin/invalidate_end pairs at all.
> > 
> > The need of the pairs is crystal clear to me: range_begin is needed
> > for GRU _but_only_if_ range_end is called after releasing the
> > reference that the VM holds on the page. _begin will flush the GRU tlb
> > and at the same time it will take a mutex that will block further GRU
> > tlb-miss-interrupts (no idea how they manange those nightmare locking,
> > I didn't even try to add more locking to KVM and I get away with the
> > fact KVM takes the pin on the page itself).
> 
> As it turns out, no actual mutex is required. _begin_ simply increments a
> count of active range invalidates, _end_ decrements the count. New TLB
> dropins are deferred while range callouts are active.
> 
> This would appear to be racy but the GRU has special hardware that
> simplifies locking. When the GRU sees a TLB invalidate, all outstanding
> misses & potentially inflight TLB dropins are marked by the GRU with a
> "kill" bit. When the dropin finally occurs, the dropin is ignored & the
> instruction is simply restarted. The instruction will fault again & the TLB
> dropin will be repeated.  This is optimized for the case where invalidates
> are rare - true for users of the GRU.

OK (thanks to Robin as well). Now I understand why you are using it,
but I don't understand why you don't defer new TLBs after the point
where the linux pte changes. If you can do that, then you look and
act much more like a TLB from the point of view of the Linux vm.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > are rather similar. However I have tried to make a point of minimising the
> > impact the the core mm/. I don't see why we need to invalidate or flush
> 
> I also tried hard to minimise the impact of the core mm/, I also
> argued with Christoph that cluttering mm/ wasn't a good idea for
> things like age_page that could be a 1 liner change instead of a
> multiple-liner change, without any loss of flexibility or readability.
> 
> > anything when changing the pte to be _more_ permissive, and I don't
> 
> Note that in my patch the invalidate_pages in mprotect can be
> trivially switched to a mprotect_pages with proper params. This will
> prevent page faults completely in the secondary MMU (there will only
> be tlb misses after the tlb flush just like for the core linux pte),
> and it'll allow all the secondary MMU pte blocks (512/1024 at time
> with my PT lock design) to be updated to have proper permissions
> matching the core linux pte.
> 
> > understand the need for invalidate_begin/invalidate_end pairs at all.
> 
> The need of the pairs is crystal clear to me: range_begin is needed
> for GRU _but_only_if_ range_end is called after releasing the
> reference that the VM holds on the page. _begin will flush the GRU tlb
> and at the same time it will take a mutex that will block further GRU
> tlb-miss-interrupts (no idea how they manange those nightmare locking,
> I didn't even try to add more locking to KVM and I get away with the
> fact KVM takes the pin on the page itself).
> 
> My patch calls invalidate_page/pages before the reference is released
> on the page, so GRU will work fine despite lack of
> range_begin. Furthermore with my patch GRU will be auto-serialized by
> the PT lock w/o the need of any additional locking.

That's why I don't understand the need for the pairs: it should be
done like this.


> > What I have done is basically create it so that the notifiers get called
> > basically in the same place as the normal TLB flushing is done, and nowhere
> > else.
> 
> That was one of my objectives too.
> 
> > I also wanted to avoid calling notifier code from inside eg. hardware TLB
> > or pte manipulation primitives. These things are already pretty well
> > spaghetti, so I'd like to just place them right where needed first... I
> > think eventually it will need a bit of a rethink to make it more consistent
> > and more general. But I prefer to do put them in the caller for the moment.
> 
> Your patch should also work for KVM but it's suboptimal, my patch can
> be orders of magnitude more efficient for GRU thanks to the
> invalidate_pages optimization. Christoph complained about having to
> call one method per pte.

OK, I didn't see the invalidate_pages call...

 
> And adding invalidate_range is useless unless you fully support
> xpmem. You're calling invalidate_range in places that can't sleep...

I thought that could be used by a non-sleeping user (not intending
to try supporting sleeping users). If it is useless then it should
go away (BTW. I didn't see your recent patch, some of my confusion
I think stems from Christoph's novel way of merging and splitting
patches).


> No idea why xpmem needs range_begin, I perfectly understand why GRU
> needs _begin with Chrisotph's patch (gru lacks the page pin) but I
> dunno why xpmem needs range_begin (xpmem has the page pin so I also
> think it could avoid using range_begin). Still to support GRU you need
> both to call invalidate_range in places that can sleep and you need
> the external rmap notifier. The moment you add xpmem into the equation
> your and my clean patches become Christoph's one...

Sorry, I kind of didn't have time to follow the conversation so well
before; are there patches posted for gru and/or xpmem?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 20:57, Andi Kleen wrote:
> On Tue, Feb 19, 2008 at 08:46:46PM +1100, Nick Piggin wrote:

> > I think it was just a simple context switch benchmark, but not lmbench
> > (which I found to be a bit too variable). But it was a long time ago...
>
> Do you still have it?
>
> I thought about writing my own but ended up being too lazy for that @)

Had a quick look but couldn't find it. It was just two threads running
and switching to each other with a couple of mutexes or yield. If I
find it, then I'll send it over.

> > > > Actually one thing I don't like about gcc is that I think it still
> > > > emits cmovs for likely/unlikely branches,
> > >
> > > That's -Os.
> >
> > And -O2 and -O3, on the gccs that I'm using, AFAIKS.
>
> Well if it still happens on gcc 4.2 with P4 tuning you should
> perhaps open a gcc PR. They tend to ignore these bugs mostly in
> my experience, but sometimes they act on them.

I'm not sure about P4 tuning... But even IMO it should not on
predictable branches too much for any (especially OOOE) CPU.

> > > > which is silly (the gcc developers
> > >
> > > It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
> > > makes sense. P4 doesn't like it though.
> >
> > If the branch is completely predictable (eg. annotated), then I
> > think branches should be used anyway. Even on well predicted
> > branches, cmov is similar speed on microbenchmarks, but it will
> > increase data hazards I think, so it will probably be worse for
> > some real world situations.
>
> At least the respective optimization manuals say they should be used.
> I presume they only made this recommendation after some extensive
> benchmarking.

What I have seen is that they tell you definitely not to use it for
predictable branches. Eg. the Intel optimization manual says

 Use the setcc and cmov instructions to eliminate unpredictable
 conditional branches where possible. Do not do this for predictable
 branches. Do not use these instructions to eliminate all
 unpredictable conditional branches, because using these instructions
 will incur execution overhead due to executing both paths of a
 conditional branch. In addition, converting conditional branches to
 cmovs or setcc trades control-flow dependence for data dependence
 and restricts the capability of the out-of-order engine.

> > But a likely branch will be _strongly_ predicted to be taken,
> > wheras a lot of the gcc heuristics simply have slightly more or
> > slightly less probability. So it's not just a question of which
> > way is more likely, but also _how_ likely it is to go that way.
>
> Yes, but a lot of the heuristics are pretty strong (>80%) and gcc will
> act on them unless it has a very strong contra cue. And that should
> normally not be the case.

True, but if you know a branch is 99%+, then use of likely/unlikely
can still be a good idea. 80% may not be enough to choose a branch
over a cmov for example.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 20:25, Andi Kleen wrote:
> On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:

> > I actually once measured context switching performance in the scheduler,
> > and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
> > performance drop.
>
> OT: what benchmarks did you use for that? I had a change some time
> ago to the CFS scheduler to avoid unpredicted indirect calls for
> the common case, but I wasn't able to benchmark a difference with the usual
> suspect benchmark (lmbench). Since it increased code size by
> a few bytes it was rejected then.

I think it was just a simple context switch benchmark, but not lmbench
(which I found to be a bit too variable). But it was a long time ago...

> > This was on a P4 which is very different from more modern CPUs both in
> > terms of branch performance characteristics,
> >
> > and icache characteristics.
>
> Hmm, the P4 the trace cache actually should not care about inline
> code that is not executed.

Yeah, which is why it is a bit different than other CPUs. Although
the L2 cache I guess is still going to suffer from sparse code, but
I guess that is a bit less important.

> > However, the P4's branch predictor is pretty good, and it should easily
>
> I think it depends on the generation. Prescott class branch
> prediction should be much better than the earlier ones.

I was using a Nocona Xeon, which I think is a Prescott class? And
don't they have much higher mispredict penalty (than older P4s)?

> > Actually one thing I don't like about gcc is that I think it still emits
> > cmovs for likely/unlikely branches,
>
> That's -Os.

And -O2 and -O3, on the gccs that I'm using, AFAIKS.

> > which is silly (the gcc developers
>
> It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
> makes sense. P4 doesn't like it though.

If the branch is completely predictable (eg. annotated), then I
think branches should be used anyway. Even on well predicted
branches, cmov is similar speed on microbenchmarks, but it will
increase data hazards I think, so it will probably be worse for
some real world situations.

> > the quite good numbers that cold CPU predictors can attain. However
> > for really performance critical code (or really "never" executed
> > code), then I think it is OK to have the hints and not have to rely
> > on gcc heuristics.
>
> But only when the explicit hints are different from what the implicit
> branch predictors would predict anyways. And if you look at the
> heuristics that is not often the case...

But a likely branch will be _strongly_ predicted to be taken,
wheras a lot of the gcc heuristics simply have slightly more or
slightly less probability. So it's not just a question of which
way is more likely, but also _how_ likely it is to go that way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
>
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.
>
> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
>
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.

This whole thing would be much better if you didn't rely on the page
lock at all, but either a) used the same locking as Linux does for its
ptes/tlbs, or b) have some locking that is private to the mmu notifier
code. Then there is not all this new stuff that has to be understood in
the core VM.

Also, why do you have to "invalidate" ranges when switching to a
_more_ permissive state? This stuff should basically be the same as
(a subset of) the TLB flushing API AFAIKS. Anything more is a pretty
big burden to put in the core VM.

See my alternative patch I posted -- I can't see why it won't work
just like a TLB.

As far as sleeping inside callbacks goes... I think there are big
problems with the patch (the sleeping patch and the external rmap
patch). I don't think it is workable in its current state. Either
we have to make some big changes to the core VM, or we have to turn
some locks into sleeping locks to do it properly AFAIKS. Neither
one is good.

But anyway, I don't really think the two approaches (Andrea's
notifiers vs sleeping/xrmap) should be tangled up too much. I
think Andrea's can possibly be quite unintrusive and useful very
soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-19 Thread Nick Piggin

On Sunday 17 February 2008 06:22, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:

> > >   flush_cache_page(vma, address, pte_pfn(*pte));
> > >   entry = ptep_clear_flush(vma, address, pte);
> > > + mmu_notifier(invalidate_page, mm, address);
> >
> > I just don't see how ths can be done if the callee has another thread in
> > the middle of establishing IO against this region of memory.
> > ->invalidate_page() _has_ to be able to block.  Confused.
>
> The page lock is held and that holds off I/O?

I think the actual answer is that "it doesn't matter".

ptes are not exactly the entity via which IO gets established, so
all we really care about here is that after the callback finishes,
we will not get any more reads or writes to the page via the
external mapping.

As far as holding off local IO goes, that is the job of the core
VM. (And no, page lock does not necessarily hold it off FYI -- it
can be writeback IO or even IO directly via buffers).

Holding off IO via the external references I guess is a job for
the notifier driver.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] my mmu notifier sample driver

2008-02-19 Thread Nick Piggin


Index: linux-2.6/drivers/char/mmu_notifier_skel.c
===
--- /dev/null
+++ linux-2.6/drivers/char/mmu_notifier_skel.c
@@ -0,0 +1,255 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static DEFINE_SPINLOCK(mmn_lock);
+static RADIX_TREE(rmap_tree, GFP_ATOMIC);
+static seqcount_t rmap_seq = SEQCNT_ZERO;
+
+static int __rmap_add(unsigned long mem, unsigned long vaddr)
+{
+   int err;
+
+   err = radix_tree_insert(_tree, mem >> PAGE_SHIFT, (void *)vaddr);
+
+   return err;
+}
+
+static void __rmap_del(unsigned long mem)
+{
+   void *ret;
+
+   ret = radix_tree_delete(_tree, mem >> PAGE_SHIFT);
+   BUG_ON(!ret);
+}
+
+static unsigned long rmap_find(unsigned long mem)
+{
+   unsigned long vaddr;
+
+   rcu_read_lock();
+   vaddr = (unsigned long)radix_tree_lookup(_tree, mem >> PAGE_SHIFT);
+   rcu_read_unlock();
+
+   return vaddr;
+}
+
+static struct page *follow_page_atomic(struct mm_struct *mm, unsigned long 
address, int write)
+{
+   struct vm_area_struct *vma;
+
+   vma = find_vma(mm, address);
+if (!vma || (vma->vm_start > address))
+return NULL;
+
+   if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+   return NULL;
+
+   return follow_page(vma, address, FOLL_GET|(write ? FOLL_WRITE : 0));
+}
+
+static int mmn_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long source_vaddr = (unsigned long)vmf->pgoff << PAGE_SHIFT;
+   unsigned long dest_vaddr = (unsigned long)vmf->virtual_address;
+   unsigned long pfn;
+   struct page *page;
+   pgprot_t prot;
+   int write = vmf->flags & FAULT_FLAG_WRITE;
+   int ret;
+
+   printk("mmn_vm_fault [EMAIL PROTECTED] sourcing from %lx\n", write ? 
"write" : "read", dest_vaddr, source_vaddr);
+
+   BUG_ON(mm != current->mm); /* disallow get_user_pages */
+
+again:
+   spin_lock(_lock);
+   write_seqcount_begin(_seq);
+   page = follow_page_atomic(mm, source_vaddr, write);
+   if (unlikely(!page)) {
+   write_seqcount_end(_seq);
+   spin_unlock(_lock);
+   ret = get_user_pages(current, mm, source_vaddr,
+   1, write, 0, , NULL);
+   if (ret != 1)
+   goto out_err;
+   put_page(page);
+   goto again;
+   }
+
+   ret = __rmap_add(source_vaddr, dest_vaddr);
+   if (ret)
+   goto out_lock;
+
+   pfn = page_to_pfn(page);
+   prot = vma->vm_page_prot;
+   if (!write)
+   vma->vm_page_prot = vm_get_page_prot(vma->vm_flags & 
~(VM_WRITE|VM_MAYWRITE));
+   ret = vm_insert_pfn(vma, dest_vaddr, pfn);
+   vma->vm_page_prot = prot;
+   if (ret) {
+   if (ret == -EBUSY)
+   WARN_ON(1);
+   goto out_rmap;
+   }
+   write_seqcount_end(_seq);
+   spin_unlock(_lock);
+   put_page(page);
+
+return VM_FAULT_NOPAGE;
+
+out_rmap:
+   __rmap_del(source_vaddr);
+out_lock:
+   write_seqcount_end(_seq);
+   spin_unlock(_lock);
+   put_page(page);
+out_err:
+   switch (ret) {
+   case -EFAULT:
+   case -EEXIST:
+   case -EBUSY:
+   return VM_FAULT_SIGBUS;
+   case -ENOMEM:
+   return VM_FAULT_OOM;
+   default:
+   BUG();
+   }
+}
+
+struct vm_operations_struct mmn_vm_ops = {
+.fault = mmn_vm_fault,
+};
+
+static int mmu_notifier_busy;
+static struct mmu_notifier mmu_notifier;
+
+static int mmn_clear_young(struct mmu_notifier *mn, unsigned long address)
+{
+   unsigned long vaddr;
+   unsigned seq;
+   struct mm_struct *mm = mn->mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep, pte;
+
+   do {
+   seq = read_seqcount_begin(_seq);
+   vaddr = rmap_find(address);
+   } while (read_seqcount_retry(_seq, seq));
+
+   if (vaddr == 0)
+   return 0;
+
+   printk("[EMAIL PROTECTED] sourced from %lx\n", vaddr, address);
+
+   spin_lock(_lock);
+pgd = pgd_offset(mm, vaddr);
+pud = pud_offset(pgd, vaddr);
+   if (pud) {
+   pmd = pmd_offset(pud, vaddr);
+   if (pmd) {
+   ptep = pte_offset_map(pmd, vaddr);
+   if (ptep) {
+   pte = *ptep;
+   if (!pte_present(pte)) {
+   /* x86 specific, don't have a vma */
+   ptep_get_and_clear(mm, vaddr, ptep);
+   __flush_tlb_one(vaddr);
+   }
+   pte_unmap(ptep);
+

[patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

Well I started reviewing the mmu notifier code, but it is kind of hard to
know what you're talking about just by reading through code and not trying
your suggestions for yourself...

So I implemented mmu notifiers slightly differently. Andrea's mmu notifiers
are rather similar. However I have tried to make a point of minimising the
impact the the core mm/. I don't see why we need to invalidate or flush
anything when changing the pte to be _more_ permissive, and I don't
understand the need for invalidate_begin/invalidate_end pairs at all.
What I have done is basically create it so that the notifiers get called
basically in the same place as the normal TLB flushing is done, and nowhere
else.

I also wanted to avoid calling notifier code from inside eg. hardware TLB
or pte manipulation primitives. These things are already pretty well
spaghetti, so I'd like to just place them right where needed first... I
think eventually it will need a bit of a rethink to make it more consistent
and more general. But I prefer to do put them in the caller for the moment.

I have also attempted to write a skeleton driver. Not like Christoph's
drivers, but one that actually does something. This one can mmap a
window into its own virtual address space. It's not perfect yet (I need
to replace page_mkwrite with ->fault in the core mm before I can get
enough information to do protection properly I think). However I think it
may be race-free in the fault vs unmap paths. It's pretty complex, I must
say.

---

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -228,6 +228,9 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -0,0 +1,69 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include 
+#include 
+
+struct mmu_notifier;
+struct mmu_notifier_operations;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier {
+   struct hlist_node hlist;
+   const struct mmu_notifier_operations *ops;
+   struct mm_struct *mm;
+};
+
+struct mmu_notifier_operations {
+   void (*release)(struct mmu_notifier *mn);
+   int (*clear_young)(struct mmu_notifier *mn, unsigned long address);
+   void (*unmap)(struct mmu_notifier *mn, unsigned long address);
+   void (*invalidate_range)(struct mmu_notifier *mn, unsigned long start, 
unsigned long end);
+};
+
+static inline void mmu_notifier_init_mm(struct mm_struct *mm)
+{
+   INIT_HLIST_HEAD(>mmu_notifier_list);
+}
+
+static inline void mmu_notifier_init(struct mmu_notifier *mn, const struct 
mmu_notifier_operations *ops, struct mm_struct *mm)
+{
+   INIT_HLIST_NODE(>hlist);
+   mn->ops = ops;
+   mn->mm = mm;
+}
+
+extern void mmu_notifier_register(struct mmu_notifier *mn);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn);
+
+extern void mmu_notifier_exit_mm(struct mm_struct *mm);
+extern int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long 
address);
+extern void mmu_notifier_unmap(struct mm_struct *mm, unsigned long address);
+extern void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long 
start, unsigned long end);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_exit_mm(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long 
address)
+{
+   return 0;
+}
+
+static inline void mmu_notifier_unmap(struct mm_struct *mm, unsigned long 
address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, 
unsigned long start, unsigned long end)
+{
+}
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif
Index: linux-2.6/kernel/fork.c
===
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -358,6 +359,7 @@ static struct mm_struct * mm_init(struct
mm->ioctx_list = NULL;
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
+   mmu_notifier_init_mm(mm);
mm_init_cgroup(mm, p);
 
if (likely(!mm_alloc_pgd(mm))) {
Index: linux-2.6/mm/filemap_xip.c
===
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -195,6 +195,7 @@ __xip_unmap (struct address_space * mapp
/* Nuke the page table entry. */
flush_cache_page(vma, address,

Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote:
> background
> 
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
>   - heavy lock contention
>   - unnecessary swap out
>
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
> reclaim throttle and explain it improve reclaim time.
>   http://marc.info/?l=linux-mm=119667465917215=2
>
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
>
>
> test machine, method and result
> ==
> 
>   CPU:  IA64 x8
>   MEM:  8GB
>   SWAP: 2GB
>
> 
>   got hackbench from
>   http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
>
>   $ /usr/bin/time hackbench 120 process 1000
>
>   this parameter mean consume all physical memory and
>   1GB swap space on my test environment.
>
> 
>
> before:
>   hackbench result:   282.30
>   /usr/bin/time result
>   user:   14.16
>   sys:1248.47
>   elapse: 432.93
>   major fault:29026
>   max parallel reclaim tasks: 1298
>   max consumption time of
>try_to_free_pages():   70394
>
> after:
>   hackbench result:   30.36
>   /usr/bin/time result
>   user:   14.26
>   sys:294.44
>   elapse: 118.01
>   major fault:3064
>   max parallel reclaim tasks: 4
>   max consumption time of
>try_to_free_pages():   12234
>
>
> conclusion
> =
> this patch improve 3 things.
> 1. reduce unnecessary swap
>(see above major fault. about 90% reduced)
> 2. improve throughput performance
>(see above hackbench result. about 90% reduced)
> 3. improve interactive performance.
>(see above max consumption of try_to_free_pages.
> about 80% reduced)
> 4. reduce lock contention.
>(see above sys time. about 80% reduced)
>
>
> Now, we got about 1000% performance improvement of hackbench :)
>
>
>
> foture works
> ==
>  - more discussion with memory controller guys.

Hi,

Yeah this is definitely needed and a nice result.

I'm worried about a) placing a global limit on parallelism, and b)
placing a limit on parallelism at all.

I think it should maybe be a per-zone thing...

What happens if you make it a per-zone mutex, and allow just a single
process to reclaim pages from a given zone at a time? I guess that is
going to slow down throughput a little bit in some cases though...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote:
 background
 
 current VM implementation doesn't has limit of # of parallel reclaim.
 when heavy workload, it bring to 2 bad things
   - heavy lock contention
   - unnecessary swap out

 abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
 reclaim throttle and explain it improve reclaim time.
   http://marc.info/?l=linux-mmm=119667465917215w=2

 but unfortunately it works only memcgroup reclaim.
 Today, I implement it again for support global reclaim and mesure it.


 test machine, method and result
 ==
 test machine
   CPU:  IA64 x8
   MEM:  8GB
   SWAP: 2GB

 test method
   got hackbench from
   http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

   $ /usr/bin/time hackbench 120 process 1000

   this parameter mean consume all physical memory and
   1GB swap space on my test environment.

 test result (average of 3 times measurement)

 before:
   hackbench result:   282.30
   /usr/bin/time result
   user:   14.16
   sys:1248.47
   elapse: 432.93
   major fault:29026
   max parallel reclaim tasks: 1298
   max consumption time of
try_to_free_pages():   70394

 after:
   hackbench result:   30.36
   /usr/bin/time result
   user:   14.26
   sys:294.44
   elapse: 118.01
   major fault:3064
   max parallel reclaim tasks: 4
   max consumption time of
try_to_free_pages():   12234


 conclusion
 =
 this patch improve 3 things.
 1. reduce unnecessary swap
(see above major fault. about 90% reduced)
 2. improve throughput performance
(see above hackbench result. about 90% reduced)
 3. improve interactive performance.
(see above max consumption of try_to_free_pages.
 about 80% reduced)
 4. reduce lock contention.
(see above sys time. about 80% reduced)


 Now, we got about 1000% performance improvement of hackbench :)



 foture works
 ==
  - more discussion with memory controller guys.

Hi,

Yeah this is definitely needed and a nice result.

I'm worried about a) placing a global limit on parallelism, and b)
placing a limit on parallelism at all.

I think it should maybe be a per-zone thing...

What happens if you make it a per-zone mutex, and allow just a single
process to reclaim pages from a given zone at a time? I guess that is
going to slow down throughput a little bit in some cases though...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] my mmu notifier sample driver

2008-02-19 Thread Nick Piggin


Index: linux-2.6/drivers/char/mmu_notifier_skel.c
===
--- /dev/null
+++ linux-2.6/drivers/char/mmu_notifier_skel.c
@@ -0,0 +1,255 @@
+#include linux/types.h
+#include linux/kernel.h
+#include linux/module.h
+#include linux/init.h
+#include linux/miscdevice.h
+#include linux/slab.h
+#include linux/sched.h
+#include linux/mm.h
+#include linux/fs.h
+#include linux/mmu_notifier.h
+#include linux/radix-tree.h
+#include linux/seqlock.h
+#include asm/tlbflush.h
+
+static DEFINE_SPINLOCK(mmn_lock);
+static RADIX_TREE(rmap_tree, GFP_ATOMIC);
+static seqcount_t rmap_seq = SEQCNT_ZERO;
+
+static int __rmap_add(unsigned long mem, unsigned long vaddr)
+{
+   int err;
+
+   err = radix_tree_insert(rmap_tree, mem  PAGE_SHIFT, (void *)vaddr);
+
+   return err;
+}
+
+static void __rmap_del(unsigned long mem)
+{
+   void *ret;
+
+   ret = radix_tree_delete(rmap_tree, mem  PAGE_SHIFT);
+   BUG_ON(!ret);
+}
+
+static unsigned long rmap_find(unsigned long mem)
+{
+   unsigned long vaddr;
+
+   rcu_read_lock();
+   vaddr = (unsigned long)radix_tree_lookup(rmap_tree, mem  PAGE_SHIFT);
+   rcu_read_unlock();
+
+   return vaddr;
+}
+
+static struct page *follow_page_atomic(struct mm_struct *mm, unsigned long 
address, int write)
+{
+   struct vm_area_struct *vma;
+
+   vma = find_vma(mm, address);
+if (!vma || (vma-vm_start  address))
+return NULL;
+
+   if (vma-vm_flags  (VM_IO | VM_PFNMAP))
+   return NULL;
+
+   return follow_page(vma, address, FOLL_GET|(write ? FOLL_WRITE : 0));
+}
+
+static int mmn_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   unsigned long source_vaddr = (unsigned long)vmf-pgoff  PAGE_SHIFT;
+   unsigned long dest_vaddr = (unsigned long)vmf-virtual_address;
+   unsigned long pfn;
+   struct page *page;
+   pgprot_t prot;
+   int write = vmf-flags  FAULT_FLAG_WRITE;
+   int ret;
+
+   printk(mmn_vm_fault [EMAIL PROTECTED] sourcing from %lx\n, write ? 
write : read, dest_vaddr, source_vaddr);
+
+   BUG_ON(mm != current-mm); /* disallow get_user_pages */
+
+again:
+   spin_lock(mmn_lock);
+   write_seqcount_begin(rmap_seq);
+   page = follow_page_atomic(mm, source_vaddr, write);
+   if (unlikely(!page)) {
+   write_seqcount_end(rmap_seq);
+   spin_unlock(mmn_lock);
+   ret = get_user_pages(current, mm, source_vaddr,
+   1, write, 0, page, NULL);
+   if (ret != 1)
+   goto out_err;
+   put_page(page);
+   goto again;
+   }
+
+   ret = __rmap_add(source_vaddr, dest_vaddr);
+   if (ret)
+   goto out_lock;
+
+   pfn = page_to_pfn(page);
+   prot = vma-vm_page_prot;
+   if (!write)
+   vma-vm_page_prot = vm_get_page_prot(vma-vm_flags  
~(VM_WRITE|VM_MAYWRITE));
+   ret = vm_insert_pfn(vma, dest_vaddr, pfn);
+   vma-vm_page_prot = prot;
+   if (ret) {
+   if (ret == -EBUSY)
+   WARN_ON(1);
+   goto out_rmap;
+   }
+   write_seqcount_end(rmap_seq);
+   spin_unlock(mmn_lock);
+   put_page(page);
+
+return VM_FAULT_NOPAGE;
+
+out_rmap:
+   __rmap_del(source_vaddr);
+out_lock:
+   write_seqcount_end(rmap_seq);
+   spin_unlock(mmn_lock);
+   put_page(page);
+out_err:
+   switch (ret) {
+   case -EFAULT:
+   case -EEXIST:
+   case -EBUSY:
+   return VM_FAULT_SIGBUS;
+   case -ENOMEM:
+   return VM_FAULT_OOM;
+   default:
+   BUG();
+   }
+}
+
+struct vm_operations_struct mmn_vm_ops = {
+.fault = mmn_vm_fault,
+};
+
+static int mmu_notifier_busy;
+static struct mmu_notifier mmu_notifier;
+
+static int mmn_clear_young(struct mmu_notifier *mn, unsigned long address)
+{
+   unsigned long vaddr;
+   unsigned seq;
+   struct mm_struct *mm = mn-mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep, pte;
+
+   do {
+   seq = read_seqcount_begin(rmap_seq);
+   vaddr = rmap_find(address);
+   } while (read_seqcount_retry(rmap_seq, seq));
+
+   if (vaddr == 0)
+   return 0;
+
+   printk([EMAIL PROTECTED] sourced from %lx\n, vaddr, address);
+
+   spin_lock(mmn_lock);
+pgd = pgd_offset(mm, vaddr);
+pud = pud_offset(pgd, vaddr);
+   if (pud) {
+   pmd = pmd_offset(pud, vaddr);
+   if (pmd) {
+   ptep = pte_offset_map(pmd, vaddr);
+   if (ptep) {
+   pte = *ptep;
+   if (!pte_present(pte)) {
+   /* x86 specific, don't have a vma */
+

[patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

Well I started reviewing the mmu notifier code, but it is kind of hard to
know what you're talking about just by reading through code and not trying
your suggestions for yourself...

So I implemented mmu notifiers slightly differently. Andrea's mmu notifiers
are rather similar. However I have tried to make a point of minimising the
impact the the core mm/. I don't see why we need to invalidate or flush
anything when changing the pte to be _more_ permissive, and I don't
understand the need for invalidate_begin/invalidate_end pairs at all.
What I have done is basically create it so that the notifiers get called
basically in the same place as the normal TLB flushing is done, and nowhere
else.

I also wanted to avoid calling notifier code from inside eg. hardware TLB
or pte manipulation primitives. These things are already pretty well
spaghetti, so I'd like to just place them right where needed first... I
think eventually it will need a bit of a rethink to make it more consistent
and more general. But I prefer to do put them in the caller for the moment.

I have also attempted to write a skeleton driver. Not like Christoph's
drivers, but one that actually does something. This one can mmap a
window into its own virtual address space. It's not perfect yet (I need
to replace page_mkwrite with -fault in the core mm before I can get
enough information to do protection properly I think). However I think it
may be race-free in the fault vs unmap paths. It's pretty complex, I must
say.

---

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -228,6 +228,9 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -0,0 +1,69 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include linux/list.h
+#include linux/mm_types.h
+
+struct mmu_notifier;
+struct mmu_notifier_operations;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier {
+   struct hlist_node hlist;
+   const struct mmu_notifier_operations *ops;
+   struct mm_struct *mm;
+};
+
+struct mmu_notifier_operations {
+   void (*release)(struct mmu_notifier *mn);
+   int (*clear_young)(struct mmu_notifier *mn, unsigned long address);
+   void (*unmap)(struct mmu_notifier *mn, unsigned long address);
+   void (*invalidate_range)(struct mmu_notifier *mn, unsigned long start, 
unsigned long end);
+};
+
+static inline void mmu_notifier_init_mm(struct mm_struct *mm)
+{
+   INIT_HLIST_HEAD(mm-mmu_notifier_list);
+}
+
+static inline void mmu_notifier_init(struct mmu_notifier *mn, const struct 
mmu_notifier_operations *ops, struct mm_struct *mm)
+{
+   INIT_HLIST_NODE(mn-hlist);
+   mn-ops = ops;
+   mn-mm = mm;
+}
+
+extern void mmu_notifier_register(struct mmu_notifier *mn);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn);
+
+extern void mmu_notifier_exit_mm(struct mm_struct *mm);
+extern int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long 
address);
+extern void mmu_notifier_unmap(struct mm_struct *mm, unsigned long address);
+extern void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long 
start, unsigned long end);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_exit_mm(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long 
address)
+{
+   return 0;
+}
+
+static inline void mmu_notifier_unmap(struct mm_struct *mm, unsigned long 
address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, 
unsigned long start, unsigned long end)
+{
+}
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif
Index: linux-2.6/kernel/fork.c
===
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -43,6 +43,7 @@
 #include linux/memcontrol.h
 #include linux/profile.h
 #include linux/rmap.h
+#include linux/mmu_notifier.h
 #include linux/acct.h
 #include linux/tsacct_kern.h
 #include linux/cn_proc.h
@@ -358,6 +359,7 @@ static struct mm_struct * mm_init(struct
mm-ioctx_list = NULL;
mm-free_area_cache = TASK_UNMAPPED_BASE;
mm-cached_hole_size = ~0UL;
+   mmu_notifier_init_mm(mm);
mm_init_cgroup(mm, p);
 
if (likely(!mm_alloc_pgd(mm))) {
Index: linux-2.6/mm/filemap_xip.c
===
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -195,6 +195,7 @@ __xip_unmap (struct

Re: [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-19 Thread Nick Piggin

On Sunday 17 February 2008 06:22, Christoph Lameter wrote:
 On Fri, 15 Feb 2008, Andrew Morton wrote:

 flush_cache_page(vma, address, pte_pfn(*pte));
 entry = ptep_clear_flush(vma, address, pte);
   + mmu_notifier(invalidate_page, mm, address);
 
  I just don't see how ths can be done if the callee has another thread in
  the middle of establishing IO against this region of memory.
  -invalidate_page() _has_ to be able to block.  Confused.

 The page lock is held and that holds off I/O?

I think the actual answer is that it doesn't matter.

ptes are not exactly the entity via which IO gets established, so
all we really care about here is that after the callback finishes,
we will not get any more reads or writes to the page via the
external mapping.

As far as holding off local IO goes, that is the job of the core
VM. (And no, page lock does not necessarily hold it off FYI -- it
can be writeback IO or even IO directly via buffers).

Holding off IO via the external references I guess is a job for
the notifier driver.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 20:25, Andi Kleen wrote:
 On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:

  I actually once measured context switching performance in the scheduler,
  and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
  performance drop.

 OT: what benchmarks did you use for that? I had a change some time
 ago to the CFS scheduler to avoid unpredicted indirect calls for
 the common case, but I wasn't able to benchmark a difference with the usual
 suspect benchmark (lmbench). Since it increased code size by
 a few bytes it was rejected then.

I think it was just a simple context switch benchmark, but not lmbench
(which I found to be a bit too variable). But it was a long time ago...


  This was on a P4 which is very different from more modern CPUs both in
  terms of branch performance characteristics,
 
  and icache characteristics.

 Hmm, the P4 the trace cache actually should not care about inline
 code that is not executed.

Yeah, which is why it is a bit different than other CPUs. Although
the L2 cache I guess is still going to suffer from sparse code, but
I guess that is a bit less important.


  However, the P4's branch predictor is pretty good, and it should easily

 I think it depends on the generation. Prescott class branch
 prediction should be much better than the earlier ones.

I was using a Nocona Xeon, which I think is a Prescott class? And
don't they have much higher mispredict penalty (than older P4s)?


  Actually one thing I don't like about gcc is that I think it still emits
  cmovs for likely/unlikely branches,

 That's -Os.

And -O2 and -O3, on the gccs that I'm using, AFAIKS.


  which is silly (the gcc developers

 It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
 makes sense. P4 doesn't like it though.

If the branch is completely predictable (eg. annotated), then I
think branches should be used anyway. Even on well predicted
branches, cmov is similar speed on microbenchmarks, but it will
increase data hazards I think, so it will probably be worse for
some real world situations.


  the quite good numbers that cold CPU predictors can attain. However
  for really performance critical code (or really never executed
  code), then I think it is OK to have the hints and not have to rely
  on gcc heuristics.

 But only when the explicit hints are different from what the implicit
 branch predictors would predict anyways. And if you look at the
 heuristics that is not often the case...

But a likely branch will be _strongly_ predicted to be taken,
wheras a lot of the gcc heuristics simply have slightly more or
slightly less probability. So it's not just a question of which
way is more likely, but also _how_ likely it is to go that way.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
 The invalidation of address ranges in a mm_struct needs to be
 performed when pages are removed or permissions etc change.

 If invalidate_range_begin() is called with locks held then we
 pass a flag into invalidate_range() to indicate that no sleeping is
 possible. Locks are only held for truncate and huge pages.

 In two cases we use invalidate_range_begin/end to invalidate
 single pages because the pair allows holding off new references
 (idea by Robin Holt).

 do_wp_page(): We hold off new references while we update the pte.

 xip_unmap: We are not taking the PageLock so we cannot
 use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
 stands in.

This whole thing would be much better if you didn't rely on the page
lock at all, but either a) used the same locking as Linux does for its
ptes/tlbs, or b) have some locking that is private to the mmu notifier
code. Then there is not all this new stuff that has to be understood in
the core VM.

Also, why do you have to invalidate ranges when switching to a
_more_ permissive state? This stuff should basically be the same as
(a subset of) the TLB flushing API AFAIKS. Anything more is a pretty
big burden to put in the core VM.

See my alternative patch I posted -- I can't see why it won't work
just like a TLB.

As far as sleeping inside callbacks goes... I think there are big
problems with the patch (the sleeping patch and the external rmap
patch). I don't think it is workable in its current state. Either
we have to make some big changes to the core VM, or we have to turn
some locks into sleeping locks to do it properly AFAIKS. Neither
one is good.

But anyway, I don't really think the two approaches (Andrea's
notifiers vs sleeping/xrmap) should be tangled up too much. I
think Andrea's can possibly be quite unintrusive and useful very
soon.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 20:57, Andi Kleen wrote:
 On Tue, Feb 19, 2008 at 08:46:46PM +1100, Nick Piggin wrote:

  I think it was just a simple context switch benchmark, but not lmbench
  (which I found to be a bit too variable). But it was a long time ago...

 Do you still have it?

 I thought about writing my own but ended up being too lazy for that @)

Had a quick look but couldn't find it. It was just two threads running
and switching to each other with a couple of mutexes or yield. If I
find it, then I'll send it over.


Actually one thing I don't like about gcc is that I think it still
emits cmovs for likely/unlikely branches,
  
   That's -Os.
 
  And -O2 and -O3, on the gccs that I'm using, AFAIKS.

 Well if it still happens on gcc 4.2 with P4 tuning you should
 perhaps open a gcc PR. They tend to ignore these bugs mostly in
 my experience, but sometimes they act on them.

I'm not sure about P4 tuning... But even IMO it should not on
predictable branches too much for any (especially OOOE) CPU.


which is silly (the gcc developers
  
   It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
   makes sense. P4 doesn't like it though.
 
  If the branch is completely predictable (eg. annotated), then I
  think branches should be used anyway. Even on well predicted
  branches, cmov is similar speed on microbenchmarks, but it will
  increase data hazards I think, so it will probably be worse for
  some real world situations.

 At least the respective optimization manuals say they should be used.
 I presume they only made this recommendation after some extensive
 benchmarking.

What I have seen is that they tell you definitely not to use it for
predictable branches. Eg. the Intel optimization manual says

 Use the setcc and cmov instructions to eliminate unpredictable
 conditional branches where possible. Do not do this for predictable
 branches. Do not use these instructions to eliminate all
 unpredictable conditional branches, because using these instructions
 will incur execution overhead due to executing both paths of a
 conditional branch. In addition, converting conditional branches to
 cmovs or setcc trades control-flow dependence for data dependence
 and restricts the capability of the out-of-order engine.


  But a likely branch will be _strongly_ predicted to be taken,
  wheras a lot of the gcc heuristics simply have slightly more or
  slightly less probability. So it's not just a question of which
  way is more likely, but also _how_ likely it is to go that way.

 Yes, but a lot of the heuristics are pretty strong (80%) and gcc will
 act on them unless it has a very strong contra cue. And that should
 normally not be the case.

True, but if you know a branch is 99%+, then use of likely/unlikely
can still be a good idea. 80% may not be enough to choose a branch
over a cmov for example.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
 On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
  are rather similar. However I have tried to make a point of minimising the
  impact the the core mm/. I don't see why we need to invalidate or flush
 
 I also tried hard to minimise the impact of the core mm/, I also
 argued with Christoph that cluttering mm/ wasn't a good idea for
 things like age_page that could be a 1 liner change instead of a
 multiple-liner change, without any loss of flexibility or readability.
 
  anything when changing the pte to be _more_ permissive, and I don't
 
 Note that in my patch the invalidate_pages in mprotect can be
 trivially switched to a mprotect_pages with proper params. This will
 prevent page faults completely in the secondary MMU (there will only
 be tlb misses after the tlb flush just like for the core linux pte),
 and it'll allow all the secondary MMU pte blocks (512/1024 at time
 with my PT lock design) to be updated to have proper permissions
 matching the core linux pte.
 
  understand the need for invalidate_begin/invalidate_end pairs at all.
 
 The need of the pairs is crystal clear to me: range_begin is needed
 for GRU _but_only_if_ range_end is called after releasing the
 reference that the VM holds on the page. _begin will flush the GRU tlb
 and at the same time it will take a mutex that will block further GRU
 tlb-miss-interrupts (no idea how they manange those nightmare locking,
 I didn't even try to add more locking to KVM and I get away with the
 fact KVM takes the pin on the page itself).
 
 My patch calls invalidate_page/pages before the reference is released
 on the page, so GRU will work fine despite lack of
 range_begin. Furthermore with my patch GRU will be auto-serialized by
 the PT lock w/o the need of any additional locking.

That's why I don't understand the need for the pairs: it should be
done like this.


  What I have done is basically create it so that the notifiers get called
  basically in the same place as the normal TLB flushing is done, and nowhere
  else.
 
 That was one of my objectives too.
 
  I also wanted to avoid calling notifier code from inside eg. hardware TLB
  or pte manipulation primitives. These things are already pretty well
  spaghetti, so I'd like to just place them right where needed first... I
  think eventually it will need a bit of a rethink to make it more consistent
  and more general. But I prefer to do put them in the caller for the moment.
 
 Your patch should also work for KVM but it's suboptimal, my patch can
 be orders of magnitude more efficient for GRU thanks to the
 invalidate_pages optimization. Christoph complained about having to
 call one method per pte.

OK, I didn't see the invalidate_pages call...

 
 And adding invalidate_range is useless unless you fully support
 xpmem. You're calling invalidate_range in places that can't sleep...

I thought that could be used by a non-sleeping user (not intending
to try supporting sleeping users). If it is useless then it should
go away (BTW. I didn't see your recent patch, some of my confusion
I think stems from Christoph's novel way of merging and splitting
patches).


 No idea why xpmem needs range_begin, I perfectly understand why GRU
 needs _begin with Chrisotph's patch (gru lacks the page pin) but I
 dunno why xpmem needs range_begin (xpmem has the page pin so I also
 think it could avoid using range_begin). Still to support GRU you need
 both to call invalidate_range in places that can sleep and you need
 the external rmap notifier. The moment you add xpmem into the equation
 your and my clean patches become Christoph's one...

Sorry, I kind of didn't have time to follow the conversation so well
before; are there patches posted for gru and/or xpmem?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote:
  On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
   understand the need for invalidate_begin/invalidate_end pairs at all.
  
  The need of the pairs is crystal clear to me: range_begin is needed
  for GRU _but_only_if_ range_end is called after releasing the
  reference that the VM holds on the page. _begin will flush the GRU tlb
  and at the same time it will take a mutex that will block further GRU
  tlb-miss-interrupts (no idea how they manange those nightmare locking,
  I didn't even try to add more locking to KVM and I get away with the
  fact KVM takes the pin on the page itself).
 
 As it turns out, no actual mutex is required. _begin_ simply increments a
 count of active range invalidates, _end_ decrements the count. New TLB
 dropins are deferred while range callouts are active.
 
 This would appear to be racy but the GRU has special hardware that
 simplifies locking. When the GRU sees a TLB invalidate, all outstanding
 misses  potentially inflight TLB dropins are marked by the GRU with a
 kill bit. When the dropin finally occurs, the dropin is ignored  the
 instruction is simply restarted. The instruction will fault again  the TLB
 dropin will be repeated.  This is optimized for the case where invalidates
 are rare - true for users of the GRU.

OK (thanks to Robin as well). Now I understand why you are using it,
but I don't understand why you don't defer new TLBs after the point
where the linux pte changes. If you can do that, then you look and
act much more like a TLB from the point of view of the Linux vm.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
 The invalidation of address ranges in a mm_struct needs to be
 performed when pages are removed or permissions etc change.

 If invalidate_range_begin() is called with locks held then we
 pass a flag into invalidate_range() to indicate that no sleeping is
 possible. Locks are only held for truncate and huge pages.

You can't sleep inside rcu_read_lock()!

I must say that for a patch that is up to v8 or whatever and is
posted twice a week to such a big cc list, it is kind of slack to
not even test it and expect other people to review it.

Also, what we are going to need here are not skeleton drivers
that just do all the *easy* bits (of registering their callbacks),
but actual fully working examples that do everything that any
real driver will need to do. If not for the sanity of the driver
writer, then for the sanity of the VM developers (I don't want
to have to understand xpmem or infiniband in order to understand
how the VM works).



 In two cases we use invalidate_range_begin/end to invalidate
 single pages because the pair allows holding off new references
 (idea by Robin Holt).

 do_wp_page(): We hold off new references while we update the pte.

 xip_unmap: We are not taking the PageLock so we cannot
 use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
 stands in.

 Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
 Signed-off-by: Robin Holt [EMAIL PROTECTED]
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

 ---
  mm/filemap_xip.c |5 +
  mm/fremap.c  |3 +++
  mm/hugetlb.c |3 +++
  mm/memory.c  |   35 +--
  mm/mmap.c|2 ++
  mm/mprotect.c|3 +++
  mm/mremap.c  |7 ++-
  7 files changed, 51 insertions(+), 7 deletions(-)

 Index: linux-2.6/mm/fremap.c
 ===
 --- linux-2.6.orig/mm/fremap.c2008-02-14 18:43:31.0 -0800
 +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.0 -0800
 @@ -15,6 +15,7 @@
  #include linux/rmap.h
  #include linux/module.h
  #include linux/syscalls.h
 +#include linux/mmu_notifier.h

  #include asm/mmu_context.h
  #include asm/cacheflush.h
 @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
   spin_unlock(mapping-i_mmap_lock);
   }

 + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
   err = populate_range(mm, vma, start, size, pgoff);
 + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
   if (!err  !(flags  MAP_NONBLOCK)) {
   if (unlikely(has_write_lock)) {
   downgrade_write(mm-mmap_sem);
 Index: linux-2.6/mm/memory.c
 ===
 --- linux-2.6.orig/mm/memory.c2008-02-14 18:43:31.0 -0800
 +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.0 -0800
 @@ -51,6 +51,7 @@
  #include linux/init.h
  #include linux/writeback.h
  #include linux/memcontrol.h
 +#include linux/mmu_notifier.h

  #include asm/pgalloc.h
  #include asm/uaccess.h
 @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
   if (is_vm_hugetlb_page(vma))
   return copy_hugetlb_page_range(dst_mm, src_mm, vma);

 + if (is_cow_mapping(vma-vm_flags))
 + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
 +
   dst_pgd = pgd_offset(dst_mm, addr);
   src_pgd = pgd_offset(src_mm, addr);
   do {
 @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
   vma, addr, next))
   return -ENOMEM;
   } while (dst_pgd++, src_pgd++, addr = next, addr != end);
 +
 + if (is_cow_mapping(vma-vm_flags))
 + mmu_notifier(invalidate_range_end, src_mm,
 + vma-vm_start, end, 0);
 +
   return 0;
  }

 @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
   struct mmu_gather *tlb;
   unsigned long end = address + size;
   unsigned long nr_accounted = 0;
 + int atomic = details ? (details-i_mmap_lock != 0) : 0;

   lru_add_drain();
   tlb = tlb_gather_mmu(mm, 0);
   update_hiwater_rss(mm);
 + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
   end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
   if (tlb)
   tlb_finish_mmu(tlb, address, end);
 + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
   return end;
  }


Where do you invalidate for munmap()?

Also, how to you resolve the case where you are not allowed to sleep?
I would have thought either you have to handle it, in which case nobody
needs to sleep; or you can't handle it, in which case the code is
broken.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
 On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
  anything when changing the pte to be _more_ permissive, and I don't
 
 Note that in my patch the invalidate_pages in mprotect can be
 trivially switched to a mprotect_pages with proper params. This will
 prevent page faults completely in the secondary MMU (there will only
 be tlb misses after the tlb flush just like for the core linux pte),
 and it'll allow all the secondary MMU pte blocks (512/1024 at time
 with my PT lock design) to be updated to have proper permissions
 matching the core linux pte.

Sorry, I realise I still didn't get this through my head yet (and also
have not seen your patch recently). So I don't know exactly what you
are doing...

But why does _anybody_ (why does Christoph's patches) need to invalidate
when they are going to be more permissive? This should be done lazily by
the driver, I would have thought.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
 These special additional callbacks are required because XPmem (and likely
 other mechanisms) do use their own rmap (multiple processes on a series
 of remote Linux instances may be accessing the memory of a process).
 F.e. XPmem may have to send out notifications to remote Linux instances
 and receive confirmation before a page can be freed.

 So we handle this like an additional Linux reverse map that is walked after
 the existing rmaps have been walked. We leave the walking to the driver
 that is then able to use something else than a spinlock to walk its reverse
 maps. So we can actually call the driver without holding spinlocks while we
 hold the Pagelock.

I don't know how this is supposed to solve anything. The sleeping
problem happens I guess mostly in truncate. And all you are doing
is putting these rmap callbacks in page_mkclean and try_to_unmap.


 However, we cannot determine the mm_struct that a page belongs to at
 that point. The mm_struct can only be determined from the rmaps by the
 device driver.

 We add another pageflag (PageExternalRmap) that is set if a page has
 been remotely mapped (f.e. by a process from another Linux instance).
 We can then only perform the callbacks for pages that are actually in
 remote use.

 Rmap notifiers need an extra page bit and are only available
 on 64 bit platforms. This functionality is not available on 32 bit!

 A notifier that uses the reverse maps callbacks does not need to provide
 the invalidate_page() method that is called when locks are held.

That doesn't seem right. To start with, the new callbacks aren't
even called in the places where invalidate_page isn't allowed to
sleep.

The problem is unmap_mapping_range, right? And unmap_mapping_range
must walk the rmaps with the mmap lock held, which is why it can't
sleep. And it can't hold any mmap_sem so it cannot prevent address
space modifications of the processes in question between the time
you unmap them from the linux ptes with unmap_mapping_range, and the
time that you unmap them from your driver.

So in the meantime, you could have eg. a fault come in and set up a
new page for one of the processes, and that page might even get
exported via the same external driver. And now you have a totally
inconsistent view.

Preventing new mappings from being set up until the old mapping is
completely flushed is basically what we need to ensure for any sane
TLB as far as I can tell. To do that, you'll need to make the mmap
lock sleep, and either take mmap_sem inside it (which is a
deadlock condition at the moment), or make ptl sleep as well. These
are simply the locks we use to prevent that from happening, so I
can't see how you can possibly hope to have a coherent TLB without
invalidating inside those locks.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Wednesday 20 February 2008 14:00, Robin Holt wrote:
 On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote:
  On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:

   Also, how to you resolve the case where you are not allowed to sleep?
   I would have thought either you have to handle it, in which case nobody
   needs to sleep; or you can't handle it, in which case the code is
   broken.
 
  I also asked exactly this, glad you reasked this too.

 Currently, we BUG_ON having a PFN in our tables and not being able
 to sleep.  These are mappings which MPT has never supported in the past
 and XPMEM was already not allowing page faults for VMAs which are not
 anonymous so it should never happen.  If the file-backed operations can
 ever get changed to allow for sleeping and a customer has a need for it,
 we would need to change XPMEM to allow those types of faults to succeed.

Do you really want to be able to swap, or are you just interested
in keeping track of unmaps / prot changes?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-19 Thread Nick Piggin

On Wednesday 20 February 2008 14:12, Robin Holt wrote:
 For XPMEM, we do not currently allow file backed
 mapping pages from being exported so we should never reach this condition.
 It has been an issue since day 1.  We have operated with that assumption
 for 6 years and have not had issues with that assumption.  The user of
 xpmem is MPT and it controls the communication buffers so it is reasonable
 to expect this type of behavior.

OK, that makes things simpler.

So why can't you export a device from your xpmem driver, which
can be mmap()ed to give out anonymous memory pages to be used
for these communication buffers?

I guess you may also want an munmap/mprotect callback, which
we don't have in the kernel right now... but at least you could
prototype it easily by having an ioctl to be called before
munmapping or mprotecting (eg. the ioctl could prevent new TLB
setup for the region, and shoot down existing ones).

This is actually going to be much faster for you if you use any
threaded applications, because you will be able to do all the
shootdown round trips outside mmap_sem, and so you will be able
to have other threads faulting and even mmap()ing / munmaping
at the same time as the shootdown is happening.

I guess there is some catch...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, RFC] kthread: (possibly) a missing memory barrier in kthread_stop()

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 10:03, Dmitry Adamushko wrote:
> Hi,
>
>
> [ description ]
>
> Subject: kthread: add a memory barrier to kthread_stop()
>
> 'kthread' threads do a check in the following order:
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> and set_current_state() implies an smp_mb().
>
> on another side (kthread_stop), wake_up_process() does not seem to
> guarantee a full mb.
>
> And 'kthread_stop_info.k' must be visible before wake_up_process()
> checks for/modifies a state of the 'kthread' task.
>
> (the patch is at the end of the message)
>
>
> [ more detailed description ]
>
> the current code might well be safe in case a to-be-stopped 'kthread'
> task is _not_ running on another CPU at the moment when kthread_stop()
> is called (in this case, 'rq->lock' will act as a kind of synch.
> point/barrier).
>
> Another case is as follows:
>
> CPU#0:
>
> ...
> while (kthread_should_stop()) {
>
>if (condition)
>  schedule();
>
>/* ... do something useful ... */   <--- EIP
>
>set_current_state(TASK_INTERRUPTIBLE);
> }
>
> so a 'kthread' task is about to call
> set_current_state(TASK_INTERRUPTIBLE) ...
>
>
> (in the mean time)
>
> CPU#1:
>
> kthread_stop()
>
> -> kthread_stop_info.k = k (*)
> -> wake_up_process()
>
> wake_up_process() looks like:
>
> (try_to_wake_up)
>
> IRQ_OFF
> LOCK
>
> old_state = p->state;
> if (!(old_state & state))  (**)
>  goto out;
>
> ...
>
> UNLOCK
> IRQ_ON
>
>
> let's suppose (*) and (**) are reordered
> (according to Documentation/memory-barriers.txt, neither IRQ_OFF nor
> LOCK may prevent it from happening).
>
> - the state is TASK_RUNNING, so we are about to return.
>
> - CPU#1 is about to execute (*) (it's guaranteed to be done before
> spin_unlock(>lock) at the end of try_to_wake_up())
>
>
> (in the mean time)
>
> CPU#0:
>
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> here, kthread_stop_info.k is not yet visible
>
> - schedule()
>
> ...
>
> we missed a 'kthread_stop' event.
>
> hum?

Looks like you are correct to me.


> TIA,
>
> ---
>
> From: Dmitry Adamushko <[EMAIL PROTECTED]>
> Subject: kthread: add a memory barrier to kthread_stop()
>
> 'kthread' threads do a check in the following order:
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> and set_current_state() implies an smp_mb().
>
> on another side (kthread_stop), wake_up_process() is not guaranteed to
> act as a full mb.
>
> 'kthread_stop_info.k' must be visible before wake_up_process() checks
> for/modifies a state of the 'kthread' task.
>
>
> Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>
>
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 0ac8878..5167110 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -211,6 +211,10 @@ int kthread_stop(struct task_struct *k)
>
>   /* Now set kthread_should_stop() to true, and wake it up. */
>   kthread_stop_info.k = k;
> +
> + /* The previous store operation must not get ahead of the wakeup. */
> + smp_mb();
> +
>   wake_up_process(k);
>   put_task_struct(k);
>
>
>
> --

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 16:58, Willy Tarreau wrote:
> On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:
> > > Note in particular the last predictors; assuming branch ending
> > > with goto, including call, causing early function return or
> > > returning negative constant are not taken. Just these alone
> > > are likely 95+% of the unlikelies in the kernel.
> >
> > Yes, gcc should be able to do pretty good heuristics, considering
> > the quite good numbers that cold CPU predictors can attain. However
> > for really performance critical code (or really "never" executed
> > code), then I think it is OK to have the hints and not have to rely
> > on gcc heuristics.
>
> in my experience, the real problem is that gcc does what *it* wants and not
> what *you* want. I've been annoyed a lot by the way it coded some loops
> that could really be blazingly fast, but which resulted in a ton of
> branches due to its predictors. And using unlikely() there was a real mess,
> because instead of just hinting the compiler with probabilities to write
> some linear code for the *most* common case, it ended up with awful
> branches everywhere with code sent far away and even duplicated for some
> branches.
>
> Sometimes, for performance critical paths, I would like gcc to be dumb and
> follow *my* code and not its hard-coded probabilities. For instance, in a
> tree traversal, you really know how you want to build your loop. And these
> days, it seems like the single method of getting it your way is doing asm,
> which obviously is not portable :-(

Probably all true.


> Maybe one thing we would need would be the ability to assign probabilities
> to each branch based on what we expect, so that gcc could build a better
> tree keeping most frequently used code tight.

I don't know if that would *directly* lead to gcc being smarter. I
think perhaps they probably don't benchmark on code bases that have
much explicit annotation (I'm sure they wouldn't seriously benchmark
any parts of Linux as part of daily development). I think the key is
to continue to use annotations _properly_, and eventually gcc should
go in the right direction if enough code uses it.

And if you have really good examples like it sounds like above, then
I guess that should be reported to gcc?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 13:40, Arjan van de Ven wrote:
> On Tue, 19 Feb 2008 13:33:53 +1100
>
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> > Actually one thing I don't like about gcc is that I think it still
> > emits cmovs for likely/unlikely branches, which is silly (the gcc
> > developers seem to be in love with that instruction). If that goes
> > away, then branch hints may be even better.
>
> only for -Os and only if the result is smaller afaik.

What is your evidence for saying this? Because here, with the latest
kernel and recent gcc-4.3 snapshot, it spits out cmov like crazy even
when compiled with -O2.

[EMAIL PROTECTED]:~/usr/src/linux-2.6$ grep cmov kernel/sched.s | wc -l
45

And yes it even does for hinted branches and even at -O2/3

[EMAIL PROTECTED]:~/tests$ cat cmov.c
int test(int a, int b)
{
if (__builtin_expect(a < b, 0))
return a;
else
return b;
}
[EMAIL PROTECTED]:~/tests$ gcc-4.3 -S -O2 cmov.c
[EMAIL PROTECTED]:~/tests$ head -13 cmov.s
.file   "cmov.c"
.text
.p2align 4,,15
..globl test
.type   test, @function
test:
..LFB2:
cmpl%edi, %esi
cmovle  %esi, %edi
movl%edi, %eax
ret
..LFE2:
.size   test, .-test

This definitely should be a branch, IMO.

> (cmov tends to be a performance loss most of the time so for -O2 and such
> it isn't used as far as I know.. it does make for nice small code however
> ;-)

It shouldn't be hard to work out the cutover point based on how
expensive cmov is, how expensive branch and branch mispredicts are,
and how often the branch is likely to be mispredicted. For an
unpredictable branch, cmov is normally quite a good win even on
modern CPUs. But gcc overuses it I think.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-18 Thread Nick Piggin

On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:
> On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote:

> > > [ 4057.31] Pid: 7035, comm: ftest03 Not tainted
> > > 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #2 [ 4057.31] RIP:
> > > 0010:[]  []
> > > iov_iter_advance+0x38/0x70 [ 4057.31] RSP: 0018:810110329b20 
> > > EFLAGS: 00010246
> > > [ 4057.31] RAX:  RBX: 0800 RCX:
> > >  [ 4057.31] RDX:  RSI:
> > > 0800 RDI: 810110329ba8 [ 4057.31] RBP:
> > > 0800 R08:  R09: 810101dbc000 [
> > > 4057.31] R10: 0004 R11:  R12:
> > > 00026000 [ 4057.31] R13: 81010d765c98 R14:
> > > 1000 R15:  [ 4057.31] FS: 
> > > 7fee589146d0() GS:80501000() knlGS:
> > > [ 4057.31] CS:  0010 DS:  ES:  CR0: 8005003b [
> > > 4057.31] CR2: 810101dbc008 CR3: 0001103da000 CR4:
> > > 06e0 [ 4057.31] DR0:  DR1:
> > >  DR2:  [ 4057.31] DR3:
> > >  DR6: 0ff0 DR7: 0400 [
> > > 4057.31] Process ftest03 (pid: 7035, threadinfo 810110328000,
> > > task 810160b0) [ 4057.31] Stack:  8025b413
> > > 81010d765ab0 804e6fd8 001201d2 [ 4057.31] 
> > > 810110329db8 00026000 810110329d38 81017b9fb500 [
> > > 4057.31]  81010d765c98 804175e0 81010d765ab0
> > >  [ 4057.31] Call Trace:
> > > [ 4057.31]  [] ?
> > > generic_file_buffered_write+0x1e3/0x6f0 [ 4057.31] 
> > > [] ? current_fs_time+0x1e/0x30 [ 4057.31] 
> > > [] ? __generic_file_aio_write_nolock+0x28f/0x440 [
> > > 4057.31]  [] ? generic_file_aio_write+0x63/0xd0 [
> > > 4057.31]  [] ? ext3_file_write+0x23/0xc0 [
> > > 4057.31]  [] ? ext3_file_write+0x0/0xc0 [
> > > 4057.31]  [] ? do_sync_readv_writev+0xcb/0x110 [
> > > 4057.31]  [] ? autoremove_wake_function+0x0/0x30
> > > [ 4057.31]  [] ?
> > > debug_check_no_locks_freed+0x7d/0x130 [ 4057.31] 
> > > [] ? trace_hardirqs_on+0xcf/0x150 [ 4057.31] 
> > > [] ? __kmalloc+0x15/0xc0
> > > [ 4057.31]  [] ? rw_copy_check_uvector+0x9d/0x130
> > > [ 4057.31]  [] ? do_readv_writev+0xe0/0x170
> > > [ 4057.31]  [] ? mutex_lock_nested+0x1a7/0x280
> > > [ 4057.31]  [] ? trace_hardirqs_on+0xcf/0x150
> > > [ 4057.31]  [] ?
> > > __mutex_unlock_slowpath+0xc9/0x170 [ 4057.31]  []
> > > ? trace_hardirqs_on+0xcf/0x150 [ 4057.31]  [] ?
> > > trace_hardirqs_on_thunk+0x35/0x3a [ 4057.31]  []
> > > ? sys_writev+0x53/0x90
> > > [ 4057.31]  [] ?
> > > system_call_after_swapgs+0x7b/0x80 [ 4057.31]
> > > [ 4057.31]
> > > [ 4057.31] Code: 48 01 77 10 48 29 77 18 c3 0f 0b eb fe 66 66 90 66
> > > 66 90 4c 8b 0f 48 8b 4f 10 49 89 f0 eb 07 66 66 66 90 49 29 c0 4d 85 c0
> > > 75 07 <49> 83 79 08 00 75 23 49 8b 51 08 48 89 d0 48 29 c8 49 39 c0 49
> > > [ 4057.31] RIP  [] iov_iter_advance+0x38/0x70 [
> > > 4057.31]  RSP 
> > > [ 4057.31] CR2: 810101dbc008
> > > [ 4057.31] Kernel panic - not syncing: Fatal exception

Can you try this patch please?
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1753,9 +1753,10 @@ static void __iov_iter_advance_iov(struc
 
 		/*
 		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments. But we mustn't try to "skip" if
+		 * we have come to the end (i->count == bytes).
 		 */
-		while (bytes || !iov->iov_len) {
+		while (bytes || (unlikely(!iov->iov_len) && i->count > bytes)) {
 			int copy = min(bytes, iov->iov_len - base);
 
 			bytes -= copy;

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 01:39, Andi Kleen wrote:
> Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > you have more faith in the authors knowledge of how his code actually
> > behaves than I think is warranted  :)
>
> iirc there was a mm patch some time ago to keep track of the actual
> unlikely values at runtime and it showed indeed some wrong ones. But the
> far majority of them are probably correct.
>
> > Or faith in that he knows what "unlikely" means.
> > I should write docs about this; but unlikely() means:
> > 1) It happens less than 0.01% of the cases.
> > 2) The compiler couldn't have figured this out by itself
> >(NULL pointer checks are compiler done already, same for some other
> > conditions) 3) It's a hot codepath where shaving 0.5 cycles (less even on
> > x86) matters (and the author is ok with taking a 500 cycles hit if he's
> > wrong)
>
> One more thing unlikely() does is to move the unlikely code out of line.
> So it should conserve some icache in critical functions, which might
> well be worth some more cycles (don't have numbers though).

I actually once measured context switching performance in the scheduler,
and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
performance drop.

This was on a P4 which is very different from more modern CPUs both in
terms of branch performance characteristics, and icache characteristics.
However, the P4's branch predictor is pretty good, and it should easily
be able to correctly predict the rt_task check if it has enough entries.
So I think much of the savings came from code transformation and movement.
Anyway, it is definitely worthwhile if used correctly.

Actually one thing I don't like about gcc is that I think it still emits
cmovs for likely/unlikely branches, which is silly (the gcc developers
seem to be in love with that instruction). If that goes away, then
branch hints may be even better.

>
> But overall I agree with you that unlikely is in most cases a bad
> idea (and I submitted the original patch introducing it originally @). That
> is because it is often used in situations where gcc's default branch
> prediction heuristics do would make exactly the same decision
>
>if (unlikely(x == NULL))
>
> is simply totally useless because gcc already assumes all x == NULL
> tests are unlikely. I appended some of the builtin heuristics from
> a recent gcc source so people can see them.
>
> Note in particular the last predictors; assuming branch ending
> with goto, including call, causing early function return or
> returning negative constant are not taken. Just these alone
> are likely 95+% of the unlikelies in the kernel.

Yes, gcc should be able to do pretty good heuristics, considering
the quite good numbers that cold CPU predictors can attain. However
for really performance critical code (or really "never" executed
code), then I think it is OK to have the hints and not have to rely
on gcc heuristics.

>
> -Andi

[snip]

Interesting, thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queueing and complete affinity w/ threads: Some results

2008-02-18 Thread Nick Piggin

On Mon, Feb 18, 2008 at 02:33:17PM +0100, Andi Kleen wrote:
> Jens Axboe <[EMAIL PROTECTED]> writes:
> 
> > and that scrapping the remote
> > softirq trigger stuff is sanest.
> 
> I actually liked Nick's queued smp_function_call_single() patch. So even
> if it was not used for block I would still like to see it being merged 
> in some form to speed up all the other IPI users.

Yeah, that hasn't been forgotten (nor have your comments about folding
my special function into smp_call_function_single).

The call function path is terribly unscalable at the moment on a lot
of architectures, and also it isn't allowed to be used with interrupts
off due to deadlock (which the queued version can allow, provided
that wait=0).

I will get around to sending that upstream soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queueing and complete affinity w/ threads: Some results

2008-02-18 Thread Nick Piggin

On Mon, Feb 18, 2008 at 02:33:17PM +0100, Andi Kleen wrote:
 Jens Axboe [EMAIL PROTECTED] writes:
 
  and that scrapping the remote
  softirq trigger stuff is sanest.
 
 I actually liked Nick's queued smp_function_call_single() patch. So even
 if it was not used for block I would still like to see it being merged 
 in some form to speed up all the other IPI users.

Yeah, that hasn't been forgotten (nor have your comments about folding
my special function into smp_call_function_single).

The call function path is terribly unscalable at the moment on a lot
of architectures, and also it isn't allowed to be used with interrupts
off due to deadlock (which the queued version can allow, provided
that wait=0).

I will get around to sending that upstream soon.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 01:39, Andi Kleen wrote:
 Arjan van de Ven [EMAIL PROTECTED] writes:
  you have more faith in the authors knowledge of how his code actually
  behaves than I think is warranted  :)

 iirc there was a mm patch some time ago to keep track of the actual
 unlikely values at runtime and it showed indeed some wrong ones. But the
 far majority of them are probably correct.

  Or faith in that he knows what unlikely means.
  I should write docs about this; but unlikely() means:
  1) It happens less than 0.01% of the cases.
  2) The compiler couldn't have figured this out by itself
 (NULL pointer checks are compiler done already, same for some other
  conditions) 3) It's a hot codepath where shaving 0.5 cycles (less even on
  x86) matters (and the author is ok with taking a 500 cycles hit if he's
  wrong)

 One more thing unlikely() does is to move the unlikely code out of line.
 So it should conserve some icache in critical functions, which might
 well be worth some more cycles (don't have numbers though).

I actually once measured context switching performance in the scheduler,
and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
performance drop.

This was on a P4 which is very different from more modern CPUs both in
terms of branch performance characteristics, and icache characteristics.
However, the P4's branch predictor is pretty good, and it should easily
be able to correctly predict the rt_task check if it has enough entries.
So I think much of the savings came from code transformation and movement.
Anyway, it is definitely worthwhile if used correctly.

Actually one thing I don't like about gcc is that I think it still emits
cmovs for likely/unlikely branches, which is silly (the gcc developers
seem to be in love with that instruction). If that goes away, then
branch hints may be even better.


 But overall I agree with you that unlikely is in most cases a bad
 idea (and I submitted the original patch introducing it originally @). That
 is because it is often used in situations where gcc's default branch
 prediction heuristics do would make exactly the same decision

if (unlikely(x == NULL))

 is simply totally useless because gcc already assumes all x == NULL
 tests are unlikely. I appended some of the builtin heuristics from
 a recent gcc source so people can see them.

 Note in particular the last predictors; assuming branch ending
 with goto, including call, causing early function return or
 returning negative constant are not taken. Just these alone
 are likely 95+% of the unlikelies in the kernel.

Yes, gcc should be able to do pretty good heuristics, considering
the quite good numbers that cold CPU predictors can attain. However
for really performance critical code (or really never executed
code), then I think it is OK to have the hints and not have to rely
on gcc heuristics.


 -Andi

[snip]

Interesting, thanks!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [ffffffff802596c8] iov_iter_advance+0x38/0x70

2008-02-18 Thread Nick Piggin

On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:
 On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote:

   [ 4057.31] Pid: 7035, comm: ftest03 Not tainted
   2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #2 [ 4057.31] RIP:
   0010:[802596c8]  [802596c8]
   iov_iter_advance+0x38/0x70 [ 4057.31] RSP: 0018:810110329b20 
   EFLAGS: 00010246
   [ 4057.31] RAX:  RBX: 0800 RCX:
    [ 4057.31] RDX:  RSI:
   0800 RDI: 810110329ba8 [ 4057.31] RBP:
   0800 R08:  R09: 810101dbc000 [
   4057.31] R10: 0004 R11:  R12:
   00026000 [ 4057.31] R13: 81010d765c98 R14:
   1000 R15:  [ 4057.31] FS: 
   7fee589146d0() GS:80501000() knlGS:
   [ 4057.31] CS:  0010 DS:  ES:  CR0: 8005003b [
   4057.31] CR2: 810101dbc008 CR3: 0001103da000 CR4:
   06e0 [ 4057.31] DR0:  DR1:
    DR2:  [ 4057.31] DR3:
    DR6: 0ff0 DR7: 0400 [
   4057.31] Process ftest03 (pid: 7035, threadinfo 810110328000,
   task 810160b0) [ 4057.31] Stack:  8025b413
   81010d765ab0 804e6fd8 001201d2 [ 4057.31] 
   810110329db8 00026000 810110329d38 81017b9fb500 [
   4057.31]  81010d765c98 804175e0 81010d765ab0
    [ 4057.31] Call Trace:
   [ 4057.31]  [8025b413] ?
   generic_file_buffered_write+0x1e3/0x6f0 [ 4057.31] 
   [8022f4ae] ? current_fs_time+0x1e/0x30 [ 4057.31] 
   [8025bbaf] ? __generic_file_aio_write_nolock+0x28f/0x440 [
   4057.31]  [8025bdc3] ? generic_file_aio_write+0x63/0xd0 [
   4057.31]  [802bfbc3] ? ext3_file_write+0x23/0xc0 [
   4057.31]  [802bfba0] ? ext3_file_write+0x0/0xc0 [
   4057.31]  [8027ebeb] ? do_sync_readv_writev+0xcb/0x110 [
   4057.31]  [8023f2b0] ? autoremove_wake_function+0x0/0x30
   [ 4057.31]  [8024be8d] ?
   debug_check_no_locks_freed+0x7d/0x130 [ 4057.31] 
   [8024bd8f] ? trace_hardirqs_on+0xcf/0x150 [ 4057.31] 
   [8027c265] ? __kmalloc+0x15/0xc0
   [ 4057.31]  [8027ea4d] ? rw_copy_check_uvector+0x9d/0x130
   [ 4057.31]  [8027f330] ? do_readv_writev+0xe0/0x170
   [ 4057.31]  [803fb5e7] ? mutex_lock_nested+0x1a7/0x280
   [ 4057.31]  [8024bd8f] ? trace_hardirqs_on+0xcf/0x150
   [ 4057.31]  [803fb389] ?
   __mutex_unlock_slowpath+0xc9/0x170 [ 4057.31]  [8024bd8f]
   ? trace_hardirqs_on+0xcf/0x150 [ 4057.31]  [803fc6ba] ?
   trace_hardirqs_on_thunk+0x35/0x3a [ 4057.31]  [8027f7c3]
   ? sys_writev+0x53/0x90
   [ 4057.31]  [8020b39b] ?
   system_call_after_swapgs+0x7b/0x80 [ 4057.31]
   [ 4057.31]
   [ 4057.31] Code: 48 01 77 10 48 29 77 18 c3 0f 0b eb fe 66 66 90 66
   66 90 4c 8b 0f 48 8b 4f 10 49 89 f0 eb 07 66 66 66 90 49 29 c0 4d 85 c0
   75 07 49 83 79 08 00 75 23 49 8b 51 08 48 89 d0 48 29 c8 49 39 c0 49
   [ 4057.31] RIP  [802596c8] iov_iter_advance+0x38/0x70 [
   4057.31]  RSP 810110329b20
   [ 4057.31] CR2: 810101dbc008
   [ 4057.31] Kernel panic - not syncing: Fatal exception

Can you try this patch please?
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1753,9 +1753,10 @@ static void __iov_iter_advance_iov(struc
 
 		/*
 		 * The !iov-iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments. But we mustn't try to skip if
+		 * we have come to the end (i-count == bytes).
 		 */
-		while (bytes || !iov-iov_len) {
+		while (bytes || (unlikely(!iov-iov_len)  i-count  bytes)) {
 			int copy = min(bytes, iov-iov_len - base);
 
 			bytes -= copy;

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 13:40, Arjan van de Ven wrote:
 On Tue, 19 Feb 2008 13:33:53 +1100

 Nick Piggin [EMAIL PROTECTED] wrote:
  Actually one thing I don't like about gcc is that I think it still
  emits cmovs for likely/unlikely branches, which is silly (the gcc
  developers seem to be in love with that instruction). If that goes
  away, then branch hints may be even better.

 only for -Os and only if the result is smaller afaik.

What is your evidence for saying this? Because here, with the latest
kernel and recent gcc-4.3 snapshot, it spits out cmov like crazy even
when compiled with -O2.

[EMAIL PROTECTED]:~/usr/src/linux-2.6$ grep cmov kernel/sched.s | wc -l
45

And yes it even does for hinted branches and even at -O2/3

[EMAIL PROTECTED]:~/tests$ cat cmov.c
int test(int a, int b)
{
if (__builtin_expect(a  b, 0))
return a;
else
return b;
}
[EMAIL PROTECTED]:~/tests$ gcc-4.3 -S -O2 cmov.c
[EMAIL PROTECTED]:~/tests$ head -13 cmov.s
.file   cmov.c
.text
.p2align 4,,15
..globl test
.type   test, @function
test:
..LFB2:
cmpl%edi, %esi
cmovle  %esi, %edi
movl%edi, %eax
ret
..LFE2:
.size   test, .-test

This definitely should be a branch, IMO.

 (cmov tends to be a performance loss most of the time so for -O2 and such
 it isn't used as far as I know.. it does make for nice small code however
 ;-)

It shouldn't be hard to work out the cutover point based on how
expensive cmov is, how expensive branch and branch mispredicts are,
and how often the branch is likely to be mispredicted. For an
unpredictable branch, cmov is normally quite a good win even on
modern CPUs. But gcc overuses it I think.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, RFC] kthread: (possibly) a missing memory barrier in kthread_stop()

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 10:03, Dmitry Adamushko wrote:
 Hi,


 [ description ]

 Subject: kthread: add a memory barrier to kthread_stop()

 'kthread' threads do a check in the following order:
 - set_current_state(TASK_INTERRUPTIBLE);
 - kthread_should_stop();

 and set_current_state() implies an smp_mb().

 on another side (kthread_stop), wake_up_process() does not seem to
 guarantee a full mb.

 And 'kthread_stop_info.k' must be visible before wake_up_process()
 checks for/modifies a state of the 'kthread' task.

 (the patch is at the end of the message)


 [ more detailed description ]

 the current code might well be safe in case a to-be-stopped 'kthread'
 task is _not_ running on another CPU at the moment when kthread_stop()
 is called (in this case, 'rq-lock' will act as a kind of synch.
 point/barrier).

 Another case is as follows:

 CPU#0:

 ...
 while (kthread_should_stop()) {

if (condition)
  schedule();

/* ... do something useful ... */   --- EIP

set_current_state(TASK_INTERRUPTIBLE);
 }

 so a 'kthread' task is about to call
 set_current_state(TASK_INTERRUPTIBLE) ...


 (in the mean time)

 CPU#1:

 kthread_stop()

 - kthread_stop_info.k = k (*)
 - wake_up_process()

 wake_up_process() looks like:

 (try_to_wake_up)

 IRQ_OFF
 LOCK

 old_state = p-state;
 if (!(old_state  state))  (**)
  goto out;

 ...

 UNLOCK
 IRQ_ON


 let's suppose (*) and (**) are reordered
 (according to Documentation/memory-barriers.txt, neither IRQ_OFF nor
 LOCK may prevent it from happening).

 - the state is TASK_RUNNING, so we are about to return.

 - CPU#1 is about to execute (*) (it's guaranteed to be done before
 spin_unlock(rq-lock) at the end of try_to_wake_up())


 (in the mean time)

 CPU#0:

 - set_current_state(TASK_INTERRUPTIBLE);
 - kthread_should_stop();

 here, kthread_stop_info.k is not yet visible

 - schedule()

 ...

 we missed a 'kthread_stop' event.

 hum?

Looks like you are correct to me.


 TIA,

 ---

 From: Dmitry Adamushko [EMAIL PROTECTED]
 Subject: kthread: add a memory barrier to kthread_stop()

 'kthread' threads do a check in the following order:
 - set_current_state(TASK_INTERRUPTIBLE);
 - kthread_should_stop();

 and set_current_state() implies an smp_mb().

 on another side (kthread_stop), wake_up_process() is not guaranteed to
 act as a full mb.

 'kthread_stop_info.k' must be visible before wake_up_process() checks
 for/modifies a state of the 'kthread' task.


 Signed-off-by: Dmitry Adamushko [EMAIL PROTECTED]


 diff --git a/kernel/kthread.c b/kernel/kthread.c
 index 0ac8878..5167110 100644
 --- a/kernel/kthread.c
 +++ b/kernel/kthread.c
 @@ -211,6 +211,10 @@ int kthread_stop(struct task_struct *k)

   /* Now set kthread_should_stop() to true, and wake it up. */
   kthread_stop_info.k = k;
 +
 + /* The previous store operation must not get ahead of the wakeup. */
 + smp_mb();
 +
   wake_up_process(k);
   put_task_struct(k);



 --

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 16:58, Willy Tarreau wrote:
 On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:
   Note in particular the last predictors; assuming branch ending
   with goto, including call, causing early function return or
   returning negative constant are not taken. Just these alone
   are likely 95+% of the unlikelies in the kernel.
 
  Yes, gcc should be able to do pretty good heuristics, considering
  the quite good numbers that cold CPU predictors can attain. However
  for really performance critical code (or really never executed
  code), then I think it is OK to have the hints and not have to rely
  on gcc heuristics.

 in my experience, the real problem is that gcc does what *it* wants and not
 what *you* want. I've been annoyed a lot by the way it coded some loops
 that could really be blazingly fast, but which resulted in a ton of
 branches due to its predictors. And using unlikely() there was a real mess,
 because instead of just hinting the compiler with probabilities to write
 some linear code for the *most* common case, it ended up with awful
 branches everywhere with code sent far away and even duplicated for some
 branches.

 Sometimes, for performance critical paths, I would like gcc to be dumb and
 follow *my* code and not its hard-coded probabilities. For instance, in a
 tree traversal, you really know how you want to build your loop. And these
 days, it seems like the single method of getting it your way is doing asm,
 which obviously is not portable :-(

Probably all true.


 Maybe one thing we would need would be the ability to assign probabilities
 to each branch based on what we expect, so that gcc could build a better
 tree keeping most frequently used code tight.

I don't know if that would *directly* lead to gcc being smarter. I
think perhaps they probably don't benchmark on code bases that have
much explicit annotation (I'm sure they wouldn't seriously benchmark
any parts of Linux as part of daily development). I think the key is
to continue to use annotations _properly_, and eventually gcc should
go in the right direction if enough code uses it.

And if you have really good examples like it sounds like above, then
I guess that should be reported to gcc?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LatencyTOP: sync_page description

2008-02-17 Thread Nick Piggin

On Saturday 16 February 2008 08:56, Török Edwin wrote:
> Hi Arjan,
>
> LatencyTOP says that sync_page is 'Writing a page to disk', however
> I see that even when no writes are involved, such as during a
> readdir, lseek, etc.
> Naming it a write is misleading, as no program is running that is
> doing constant writes to the disk. The only program is writing to a
> temp dir in /dev/shm.
>
> What would be a better description for sync_page?

Waiting on a page state change (usually: waiting for IO, but can be
also waiting for the page lock which is taken by some other part of
the kernel eg in page reclaim, truncate, buffered writes, page
faults).

> Here are some /proc/latency_stats containing sync_page:
>
> 125 6937678 210821 sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> get_unused_fd_flags path_walk do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area
> 306 5677749 215746 sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> do_sync_read autoremove_wake_function security_file_permission
> rw_verify_area vfs_read vfs_llseek
> 21 435657 59966 sync_page sync_page __lock_page wake_bit_function
> read_cache_page_async ntfs_readpage read_cache_page map_mft_record
> ntfs_read_locked_inode ntfs_alloc_big_inode iget5_locked
> ntfs_test_inode
> 195 2716409 133660 blk_unplug sync_page sync_page __lock_page
> wake_bit_function read_cache_page_async ntfs_readpage
> read_cache_page map_mft_record ntfs_read_locked_inode
> ntfs_alloc_big_inode iget5_locked
> 28 1881278 181986 add_to_page_cache_lru sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read get_unused_fd_flags path_walk do_sync_read
> autoremove_wake_function security_file_permission
> 2 17132 9746 add_to_page_cache_lru sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area vfs_read
> 1 70 70 irq_exit sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> do_sync_read autoremove_wake_function security_file_permission
> rw_verify_area vfs_read
> 23 306682 114514 blk_unplug sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area vfs_read
> 1 153 153 hrtimer_interrupt smp_apic_timer_interrupt sync_page
> sync_page_killable sync_page_killable __lock_page_killable
> wake_bit_function generic_file_aio_read do_sync_read
> autoremove_wake_function cfq_idle_slice_timer security_file_permission

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-17 Thread Nick Piggin

On Saturday 16 February 2008 14:37, Andrew Morton wrote:
> On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter <[EMAIL PROTECTED]> 
wrote:
> > Two callbacks to remove individual pages as done in rmap code
> >
> > invalidate_page()
> >
> > Called from the inner loop of rmap walks to invalidate pages.
> >
> > age_page()
> >
> > Called for the determination of the page referenced status.
> >
> > If we do not care about page referenced status then an age_page callback
> > may be be omitted. PageLock and pte lock are held when either of the
> > functions is called.
>
> The age_page mystery shallows.

BTW. can this callback be called mmu_notifier_clear_flush_young? To
match the core VM.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-17 Thread Nick Piggin

On Saturday 16 February 2008 14:37, Andrew Morton wrote:
 On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter [EMAIL PROTECTED] 
wrote:
  Two callbacks to remove individual pages as done in rmap code
 
  invalidate_page()
 
  Called from the inner loop of rmap walks to invalidate pages.
 
  age_page()
 
  Called for the determination of the page referenced status.
 
  If we do not care about page referenced status then an age_page callback
  may be be omitted. PageLock and pte lock are held when either of the
  functions is called.

 The age_page mystery shallows.

BTW. can this callback be called mmu_notifier_clear_flush_young? To
match the core VM.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LatencyTOP: sync_page description

2008-02-17 Thread Nick Piggin

On Saturday 16 February 2008 08:56, Török Edwin wrote:
 Hi Arjan,

 LatencyTOP says that sync_page is 'Writing a page to disk', however
 I see that even when no writes are involved, such as during a
 readdir, lseek, etc.
 Naming it a write is misleading, as no program is running that is
 doing constant writes to the disk. The only program is writing to a
 temp dir in /dev/shm.

 What would be a better description for sync_page?

Waiting on a page state change (usually: waiting for IO, but can be
also waiting for the page lock which is taken by some other part of
the kernel eg in page reclaim, truncate, buffered writes, page
faults).

 Here are some /proc/latency_stats containing sync_page:

 125 6937678 210821 sync_page sync_page_killable sync_page_killable
 __lock_page_killable wake_bit_function generic_file_aio_read
 get_unused_fd_flags path_walk do_sync_read autoremove_wake_function
 security_file_permission rw_verify_area
 306 5677749 215746 sync_page sync_page_killable sync_page_killable
 __lock_page_killable wake_bit_function generic_file_aio_read
 do_sync_read autoremove_wake_function security_file_permission
 rw_verify_area vfs_read vfs_llseek
 21 435657 59966 sync_page sync_page __lock_page wake_bit_function
 read_cache_page_async ntfs_readpage read_cache_page map_mft_record
 ntfs_read_locked_inode ntfs_alloc_big_inode iget5_locked
 ntfs_test_inode
 195 2716409 133660 blk_unplug sync_page sync_page __lock_page
 wake_bit_function read_cache_page_async ntfs_readpage
 read_cache_page map_mft_record ntfs_read_locked_inode
 ntfs_alloc_big_inode iget5_locked
 28 1881278 181986 add_to_page_cache_lru sync_page sync_page_killable
 sync_page_killable __lock_page_killable wake_bit_function
 generic_file_aio_read get_unused_fd_flags path_walk do_sync_read
 autoremove_wake_function security_file_permission
 2 17132 9746 add_to_page_cache_lru sync_page sync_page_killable
 sync_page_killable __lock_page_killable wake_bit_function
 generic_file_aio_read do_sync_read autoremove_wake_function
 security_file_permission rw_verify_area vfs_read
 1 70 70 irq_exit sync_page sync_page_killable sync_page_killable
 __lock_page_killable wake_bit_function generic_file_aio_read
 do_sync_read autoremove_wake_function security_file_permission
 rw_verify_area vfs_read
 23 306682 114514 blk_unplug sync_page sync_page_killable
 sync_page_killable __lock_page_killable wake_bit_function
 generic_file_aio_read do_sync_read autoremove_wake_function
 security_file_permission rw_verify_area vfs_read
 1 153 153 hrtimer_interrupt smp_apic_timer_interrupt sync_page
 sync_page_killable sync_page_killable __lock_page_killable
 wake_bit_function generic_file_aio_read do_sync_read
 autoremove_wake_function cfq_idle_slice_timer security_file_permission

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-13 Thread Nick Piggin

On Wednesday 13 February 2008 20:32, Andrew Morton wrote:
> On Wed, 13 Feb 2008 20:24:03 +1100 Nick Piggin <[EMAIL PROTECTED]> 
wrote:
> > BTW is it really true that the buffer can never be locked by
> > anything else at this point?
>
> It has been for the past five or six years.  With the page locked, nobody
> else can get at that page.

Hmm OK.


> > What about fsync_buffers_list?
>
> They're metadata buffers, not regular file data.  Things might get ugly if
> IO to /dev/sda went via that path, but it doesn't.

Yeah right... so the BUG_ON is basically because you want to avoid
the overhead of locking the buffer (which would presumably allow it
to work in situations where someone else might lock the buffer without
locking the page?). OK, makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-13 Thread Nick Piggin

On Wednesday 13 February 2008 20:01, Andrew Morton wrote:
> On Wed, 13 Feb 2008 08:26:27 +0100 Bart Dopheide <[EMAIL PROTECTED]> wrote:
> > On Wed, Feb 13, 2008 at 12:05:45PM +1100, Nick Piggin wrote:
> > :)On Wednesday 13 February 2008 08:50, Alan Cox wrote:
> > :)> Almost certainly a hardware fail of some sort.
> > :)
> > :)Right, but the kernel shouldn't go bug...
> >
> > Indeed, that's why I'm reporting.
> >
> > :)I don't have a copy of your exact source code... which condition in
> > :)__mpage_writepage went BUG?
> >
> > BUG_ON(buffer_locked(bh));
> >
> > In a bit of context:
> > 482:if (page_has_buffers(page)) {
> > 483:struct buffer_head *head = page_buffers(page);
> > 484:struct buffer_head *bh = head;
> > 485:
> > 486:/* If they're all mapped and dirty, do it */
> > 487:page_block = 0;
> > 488:do {
> > 489:BUG_ON(buffer_locked(bh));
> > 490:if (!buffer_mapped(bh)) {
> > 491:/*
> > 492: * unmapped dirty buffers are created by
> > 493: * __set_page_dirty_buffers -> mmapped
> > data 494: */
> > 495:if (buffer_dirty(bh))
> > 496:goto confused;
> > 497:if (first_unmapped == blocks_per_page)
> > 498:first_unmapped = page_block;
> > 499:continue;
> > 500:}
>
> Probably means that either fat, IDE, block or fs/buffer.c failed to unlock
> a buffer_head when the IO error happened.  It's unlikely to be fat.

Yes that looks like it would be the problem. I can't really
see anything in buffer.c that would do it... 

BTW is it really true that the buffer can never be locked by
anything else at this point? What about fsync_buffers_list?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-13 Thread Nick Piggin

On Wednesday 13 February 2008 20:01, Andrew Morton wrote:
 On Wed, 13 Feb 2008 08:26:27 +0100 Bart Dopheide [EMAIL PROTECTED] wrote:
  On Wed, Feb 13, 2008 at 12:05:45PM +1100, Nick Piggin wrote:
  :)On Wednesday 13 February 2008 08:50, Alan Cox wrote:
  :) Almost certainly a hardware fail of some sort.
  :)
  :)Right, but the kernel shouldn't go bug...
 
  Indeed, that's why I'm reporting.
 
  :)I don't have a copy of your exact source code... which condition in
  :)__mpage_writepage went BUG?
 
  BUG_ON(buffer_locked(bh));
 
  In a bit of context:
  482:if (page_has_buffers(page)) {
  483:struct buffer_head *head = page_buffers(page);
  484:struct buffer_head *bh = head;
  485:
  486:/* If they're all mapped and dirty, do it */
  487:page_block = 0;
  488:do {
  489:BUG_ON(buffer_locked(bh));
  490:if (!buffer_mapped(bh)) {
  491:/*
  492: * unmapped dirty buffers are created by
  493: * __set_page_dirty_buffers - mmapped
  data 494: */
  495:if (buffer_dirty(bh))
  496:goto confused;
  497:if (first_unmapped == blocks_per_page)
  498:first_unmapped = page_block;
  499:continue;
  500:}

 Probably means that either fat, IDE, block or fs/buffer.c failed to unlock
 a buffer_head when the IO error happened.  It's unlikely to be fat.

Yes that looks like it would be the problem. I can't really
see anything in buffer.c that would do it... 

BTW is it really true that the buffer can never be locked by
anything else at this point? What about fsync_buffers_list?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-13 Thread Nick Piggin

On Wednesday 13 February 2008 20:32, Andrew Morton wrote:
 On Wed, 13 Feb 2008 20:24:03 +1100 Nick Piggin [EMAIL PROTECTED] 
wrote:
  BTW is it really true that the buffer can never be locked by
  anything else at this point?

 It has been for the past five or six years.  With the page locked, nobody
 else can get at that page.

Hmm OK.


  What about fsync_buffers_list?

 They're metadata buffers, not regular file data.  Things might get ugly if
 IO to /dev/sda went via that path, but it doesn't.

Yeah right... so the BUG_ON is basically because you want to avoid
the overhead of locking the buffer (which would presumably allow it
to work in situations where someone else might lock the buffer without
locking the page?). OK, makes sense.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote:
> Nick Piggin wrote:

> > But don't let me dissuade you from making these good improvements
> > to Linux as well :) Just that it isn't really going to be hard-rt
> > in general.
>
> Actually that's the cool thing about CPU isolation. Get rid of all latency
> sources from the CPU(s) and you get youself as hard-RT as it gets.

Hmm, maybe. Removing all sources of latency from the CPU kind of
implies that you have to audit the whole kernel for source of
latency.

> I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec
> worst case and ~200nsec average latency. I do not even need Adeos/Xenomai
> or Preemp-RT just a few very small patches. And it can be used for non RT
> stuff too.

OK, but you then are very restricted in what you can do, and easily
can break it especially if you run any userspace on that CPU. If
you just run a kernel module that, after setup, doesn't use any
other kernel resources except interrupt handling, then you might be
OK (depending on whether even interrupt handling can run into
contended locks)...

If you started doing very much more, then you can easily run into
trouble.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ALPHA] ES40 fails to boot with >=kernel 2.6.23

2008-02-12 Thread Nick Piggin

On Tuesday 12 February 2008 04:27, Raúl Porcel wrote:
> Hi,
>
> We have a Compaq AlphaServer ES40 and since 2.6.23 it won't boot. I'm
> attaching the console log and the kernel config.
>
> Need to say that with a DEC Xp1000 it works fine, although they're
> different machines, of course.
> With .22 it boots fine, and by booting fine i mean after we reverted to
> 2.6.22 it booted again and everything worked as expected.
> Still hangs with latest kernel.
>
> I'm attaching the verlinux output as well, hope it helps. If i'm missing
> something, please don't hesitate to ask.
>
> Thanks

Hi,

Thanks for reporting. I'm not an alpha person, but I have
cc'ed them in case they missed this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 11:17, Nick Piggin wrote:
> On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:

> > It's a trivial dumb module which does nothing but loads and unloads.
> > I redid ftest03 later without any suspicious activity and it oopsed the
> > same way.
>
> Ah crap. Hmm, maybe I didn't consider all cases with my last patch to
> that code... is there an easy way to get the ftest03 source and run
> it?

OK I didn't realise it is a test from ltp.

But I can't reproduce it for the life of me with the latest git kernel
and latest ltp tarball.

Is it easy to reproduce? Are you reproducing it simply by running the
ftest03 binary directly from the shell? How many times between oopses?
It is multi-process but no threads, so races should be minimal down
this path -- can you get an strace of the failing process?

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote:
> David Miller wrote:
> > From: Nick Piggin <[EMAIL PROTECTED]>
> > Date: Tue, 12 Feb 2008 17:41:21 +1100
> >
> >> stop machine is used for more than just module loading and unloading.
> >> I don't think you can just disable it.
> >
> > Right, in particular it is used for CPU hotplug.
>
> Ooops. Totally missed that. And a bunch of other places.
>
> [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run
> Documentation/cpu-hotplug.txt
> arch/s390/kernel/kprobes.c
> drivers/char/hw_random/intel-rng.c
> include/linux/stop_machine.h
> kernel/cpu.c
> kernel/module.c
> kernel/stop_machine.c
> mm/page_alloc.c
>
> I wonder why I did not see any issues when I disabled stop machine
> completely. I mentioned in the other thread that I commented out the part
> that actually halts the machine and ran it for several hours on my dual
> core laptop and on the quad core server. Tried all kinds of workloads,
> which include constant module removal and insertion, and cpu hotplug as
> well. It cannot be just luck :).

It really is. With subtle races, it can take a lot more than a few
hours. Consider that we have subtle races still in the kernel now,
which are almost never or rarely hit in maybe 10,000 hours * every
single person who has been using the current kernel for the past
year.

For a less theoretical example -- when I was writing the RCU radix
tree code, I tried to run directed stress tests on a 64 CPU Altix
machine (which found no bugs). Then I ran it on a dedicated test
harness that could actually do a lot more than the existing kernel
users are able to, and promptly found a couple more bugs (on a 2
CPU system).

But your primary defence against concurrency bugs _has_ to be
knowing the code and all its interactions.

> Clearly though, you guys are right. It cannot be simply disabled. Based on
> the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390
> and intel rng driver. Hopefully we can avoid it at least in module
> insertion/removal.

Yes, reducing the number of users by going through their code and
showing that it is safe, is the right way to do this. Also, you
could avoid module insertion/removal?

FWIW, I think the idea of trying to turn Linux into giving hard
realtime guarantees is just insane. If that is what you want, you
would IMO be much better off to spend effort with something like
improving adeos and communicatoin/administration between Linux and
the hard-rt kernel.

But don't let me dissuade you from making these good improvements
to Linux as well :) Just that it isn't really going to be hard-rt
in general.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2 resend] mm: various cleanups in get_user_pages()

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 00:10, Eugene Teo wrote:
> Sorry for the repeated emails. Kindly ignore the previous resend. Please
> review this instead. Thanks. I have tested this.

If it is causing this much problems, can you split the cleanups into
their own patches.


> [PATCH 2/2] mm: various cleanups in get_user_pages()
>
> This patch contains various cleanups, including making sure vma is valid,
> and the return value of follow_hugetlb_page() is validated.
>
> Signed-off-by: Eugene Teo <[EMAIL PROTECTED]>
> ---
>  mm/memory.c |   24 
>  1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 54f951b..c7e0610 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1003,7 +1003,9 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, unsigned int foll_flags;
>
>   vma = find_extend_vma(mm, start);
> - if (!vma && in_gate_area(tsk, start)) {
> + if (!vma)
> + goto finish_or_fault;
> + if (in_gate_area(tsk, start)) {
>   unsigned long pg = start & PAGE_MASK;
>   struct vm_area_struct *gate_vma = get_gate_vma(tsk);
>   pgd_t *pgd;

Doesn't this break the logic?

If you don't have a vma, but you are in the gate area, then you
should use the gate vma. With your patch, gate area will fault.

> @@ -1011,7 +1013,7 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, pmd_t *pmd;
>   pte_t *pte;
>   if (write) /* user gate pages are read-only */
> - return i ? : -EFAULT;
> + goto finish_or_fault;

I don't know if this is exactly a cleanup or not... I guess gcc
probably isn't smart enough to fold them all together, so it should
use a little less code in the unlikely branches. Does it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 08:50, Alan Cox wrote:
> > Feb 12 19:55:08 butterfly kernel: hde: dma timeout error: status=0xd0 {
> > Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode was: unknown
>
> Your drive stopped responding.
>
> > Feb 12 19:55:08 butterfly kernel: hde: DMA disabled
> > Feb 12 19:55:08 butterfly kernel: PDC202XX: Primary channel reset.
> > Feb 12 19:55:08 butterfly kernel: PDC202XX: Secondary channel reset.
>
> We gave it a good kicking and it stayed offline
>
> > Feb 12 19:55:08 butterfly kernel: hde: set_drive_speed_status:
> > status=0xd0 { Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode
> > was: unknown Feb 12 19:55:47 butterfly kernel: ide2: reset timed-out,
> > status=0xd0 Feb 12 19:55:47 butterfly kernel: hde: status timeout:
> > status=0xd0 { Busy }
>
> And we gave up.
>
> Almost certainly a hardware fail of some sort.

Right, but the kernel shouldn't go bug...

I don't have a copy of your exact source code... which condition in
__mpage_writepage went BUG?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:
> On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote:
> > On Sun, 10 Feb 2008 17:00:31 +0300
> >
> > Alexey Dobriyan <[EMAIL PROTECTED]> wrote:
> > > This happened during LTP. FWIW, modprobe/rmmod trivial empty module
> > > together with cat /proc/*/wchan and cat /proc/modules were also
> > > running.
> > >
> > > Box is E6400, much debugging is on, config below.
> > >
> > >
> > > [ 4057.31] BUG: unable to handle kernel paging request at
> > > 810101dbc008 [ 4057.31] IP: []
> > > iov_iter_advance+0x38/0x70 [ 4057.31] PGD 8063 PUD c063 PMD
> > > 153baa163 PTE 800101dbc160 [ 4057.31] Oops:  [1] SMP
> > > DEBUG_PAGEALLOC
> > > [ 4057.31] CPU 0
> > > [ 4057.31] Modules linked in: [last unloaded: foo]
> >
> > what is this foo.ko of which you speak, and did it wreck your kernel?
>
> It's a trivial dumb module which does nothing but loads and unloads.
> I redid ftest03 later without any suspicious activity and it oopsed the
> same way.

Ah crap. Hmm, maybe I didn't consider all cases with my last patch to
that code... is there an easy way to get the ftest03 source and run
it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [ffffffff802596c8] iov_iter_advance+0x38/0x70

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:
 On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote:
  On Sun, 10 Feb 2008 17:00:31 +0300
 
  Alexey Dobriyan [EMAIL PROTECTED] wrote:
   This happened during LTP. FWIW, modprobe/rmmod trivial empty module
   together with cat /proc/*/wchan and cat /proc/modules were also
   running.
  
   Box is E6400, much debugging is on, config below.
  
  
   [ 4057.31] BUG: unable to handle kernel paging request at
   810101dbc008 [ 4057.31] IP: [802596c8]
   iov_iter_advance+0x38/0x70 [ 4057.31] PGD 8063 PUD c063 PMD
   153baa163 PTE 800101dbc160 [ 4057.31] Oops:  [1] SMP
   DEBUG_PAGEALLOC
   [ 4057.31] CPU 0
   [ 4057.31] Modules linked in: [last unloaded: foo]
 
  what is this foo.ko of which you speak, and did it wreck your kernel?

 It's a trivial dumb module which does nothing but loads and unloads.
 I redid ftest03 later without any suspicious activity and it oopsed the
 same way.

Ah crap. Hmm, maybe I didn't consider all cases with my last patch to
that code... is there an easy way to get the ftest03 source and run
it?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 08:50, Alan Cox wrote:
  Feb 12 19:55:08 butterfly kernel: hde: dma timeout error: status=0xd0 {
  Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode was: unknown

 Your drive stopped responding.

  Feb 12 19:55:08 butterfly kernel: hde: DMA disabled
  Feb 12 19:55:08 butterfly kernel: PDC202XX: Primary channel reset.
  Feb 12 19:55:08 butterfly kernel: PDC202XX: Secondary channel reset.

 We gave it a good kicking and it stayed offline

  Feb 12 19:55:08 butterfly kernel: hde: set_drive_speed_status:
  status=0xd0 { Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode
  was: unknown Feb 12 19:55:47 butterfly kernel: ide2: reset timed-out,
  status=0xd0 Feb 12 19:55:47 butterfly kernel: hde: status timeout:
  status=0xd0 { Busy }

 And we gave up.

 Almost certainly a hardware fail of some sort.

Right, but the kernel shouldn't go bug...

I don't have a copy of your exact source code... which condition in
__mpage_writepage went BUG?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [ffffffff802596c8] iov_iter_advance+0x38/0x70

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 11:17, Nick Piggin wrote:
 On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:

  It's a trivial dumb module which does nothing but loads and unloads.
  I redid ftest03 later without any suspicious activity and it oopsed the
  same way.

 Ah crap. Hmm, maybe I didn't consider all cases with my last patch to
 that code... is there an easy way to get the ftest03 source and run
 it?

OK I didn't realise it is a test from ltp.

But I can't reproduce it for the life of me with the latest git kernel
and latest ltp tarball.

Is it easy to reproduce? Are you reproducing it simply by running the
ftest03 binary directly from the shell? How many times between oopses?
It is multi-process but no threads, so races should be minimal down
this path -- can you get an strace of the failing process?

Thanks,
Nick
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote:
 David Miller wrote:
  From: Nick Piggin [EMAIL PROTECTED]
  Date: Tue, 12 Feb 2008 17:41:21 +1100

  stop machine is used for more than just module loading and unloading.
  I don't think you can just disable it.

  Right, in particular it is used for CPU hotplug.

 Ooops. Totally missed that. And a bunch of other places.

 [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run
 Documentation/cpu-hotplug.txt
 arch/s390/kernel/kprobes.c
 drivers/char/hw_random/intel-rng.c
 include/linux/stop_machine.h
 kernel/cpu.c
 kernel/module.c
 kernel/stop_machine.c
 mm/page_alloc.c

 I wonder why I did not see any issues when I disabled stop machine
 completely. I mentioned in the other thread that I commented out the part
 that actually halts the machine and ran it for several hours on my dual
 core laptop and on the quad core server. Tried all kinds of workloads,
 which include constant module removal and insertion, and cpu hotplug as
 well. It cannot be just luck :).

It really is. With subtle races, it can take a lot more than a few
hours. Consider that we have subtle races still in the kernel now,
which are almost never or rarely hit in maybe 10,000 hours * every
single person who has been using the current kernel for the past
year.

For a less theoretical example -- when I was writing the RCU radix
tree code, I tried to run directed stress tests on a 64 CPU Altix
machine (which found no bugs). Then I ran it on a dedicated test
harness that could actually do a lot more than the existing kernel
users are able to, and promptly found a couple more bugs (on a 2
CPU system).

But your primary defence against concurrency bugs _has_ to be
knowing the code and all its interactions.

 Clearly though, you guys are right. It cannot be simply disabled. Based on
 the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390
 and intel rng driver. Hopefully we can avoid it at least in module
 insertion/removal.

Yes, reducing the number of users by going through their code and
showing that it is safe, is the right way to do this. Also, you
could avoid module insertion/removal?

FWIW, I think the idea of trying to turn Linux into giving hard
realtime guarantees is just insane. If that is what you want, you
would IMO be much better off to spend effort with something like
improving adeos and communicatoin/administration between Linux and
the hard-rt kernel.

But don't let me dissuade you from making these good improvements
to Linux as well :) Just that it isn't really going to be hard-rt
in general.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ALPHA] ES40 fails to boot with =kernel 2.6.23

2008-02-12 Thread Nick Piggin

On Tuesday 12 February 2008 04:27, Raúl Porcel wrote:
 Hi,

 We have a Compaq AlphaServer ES40 and since 2.6.23 it won't boot. I'm
 attaching the console log and the kernel config.

 Need to say that with a DEC Xp1000 it works fine, although they're
 different machines, of course.
 With .22 it boots fine, and by booting fine i mean after we reverted to
 2.6.22 it booted again and everything worked as expected.
 Still hangs with latest kernel.

 I'm attaching the verlinux output as well, hope it helps. If i'm missing
 something, please don't hesitate to ask.

 Thanks

Hi,

Thanks for reporting. I'm not an alpha person, but I have
cc'ed them in case they missed this.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2 resend] mm: various cleanups in get_user_pages()

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 00:10, Eugene Teo wrote:
 Sorry for the repeated emails. Kindly ignore the previous resend. Please
 review this instead. Thanks. I have tested this.

If it is causing this much problems, can you split the cleanups into
their own patches.


 [PATCH 2/2] mm: various cleanups in get_user_pages()

 This patch contains various cleanups, including making sure vma is valid,
 and the return value of follow_hugetlb_page() is validated.

 Signed-off-by: Eugene Teo [EMAIL PROTECTED]
 ---
  mm/memory.c |   24 
  1 file changed, 16 insertions(+), 8 deletions(-)

 diff --git a/mm/memory.c b/mm/memory.c
 index 54f951b..c7e0610 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -1003,7 +1003,9 @@ int get_user_pages(struct task_struct *tsk, struct
 mm_struct *mm, unsigned int foll_flags;

   vma = find_extend_vma(mm, start);
 - if (!vma  in_gate_area(tsk, start)) {
 + if (!vma)
 + goto finish_or_fault;
 + if (in_gate_area(tsk, start)) {
   unsigned long pg = start  PAGE_MASK;
   struct vm_area_struct *gate_vma = get_gate_vma(tsk);
   pgd_t *pgd;

Doesn't this break the logic?

If you don't have a vma, but you are in the gate area, then you
should use the gate vma. With your patch, gate area will fault.

 @@ -1011,7 +1013,7 @@ int get_user_pages(struct task_struct *tsk, struct
 mm_struct *mm, pmd_t *pmd;
   pte_t *pte;
   if (write) /* user gate pages are read-only */
 - return i ? : -EFAULT;
 + goto finish_or_fault;

I don't know if this is exactly a cleanup or not... I guess gcc
probably isn't smart enough to fold them all together, so it should
use a little less code in the unlikely branches. Does it?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote:
 Nick Piggin wrote:

  But don't let me dissuade you from making these good improvements
  to Linux as well :) Just that it isn't really going to be hard-rt
  in general.

 Actually that's the cool thing about CPU isolation. Get rid of all latency
 sources from the CPU(s) and you get youself as hard-RT as it gets.

Hmm, maybe. Removing all sources of latency from the CPU kind of
implies that you have to audit the whole kernel for source of
latency.

 I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec
 worst case and ~200nsec average latency. I do not even need Adeos/Xenomai
 or Preemp-RT just a few very small patches. And it can be used for non RT
 stuff too.

OK, but you then are very restricted in what you can do, and easily
can break it especially if you run any userspace on that CPU. If
you just run a kernel module that, after setup, doesn't use any
other kernel resources except interrupt handling, then you might be
OK (depending on whether even interrupt handling can run into
contended locks)...

If you started doing very much more, then you can easily run into
trouble.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 15:10, Max Krasnyansky wrote:

> Rusty - Stop machine.
>After doing a bunch of testing last three days I actually downgraded
> stop machine changes from [highly experimental] to simply [experimental].
> Pleas see this thread for more info:
> http://marc.info/?l=linux-kernel=120243837206248=2 Short story is that
> I ran several insmod/rmmod workloads on live multi-core boxes with stop
> machine _completely_ disabled and did no see any issues. Rusty did not get
> a chance to reply yet, I hopping that we'll be able to make "stop machine"
> completely optional for some configurations.

stop machine is used for more than just module loading and unloading.
I don't think you can just disable it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid buffer overflows in get_user_pages()

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 14:16, Robert Hancock wrote:
> Nick Piggin wrote:
> > On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote:
> >> Avoid buffer overflows in get_user_pages()
> >>
> >> So I spent a while pounding my head against my monitor trying to figure
> >> out the vmsplice() vulnerability - how could a failure to check for
> >> *read* access turn into a root exploit?  It turns out that it's a buffer
> >> overflow problem which is made easy by the way get_user_pages() is
> >> coded.
> >>
> >> In particular, "len" is a signed int, and it is only checked at the
> >> *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
> >> will execute once and decrement len to -1.  At that point, the loop will
> >> proceed until the next invalid address is found; in the process, it will
> >> likely overflow the pages array passed in to get_user_pages().
> >>
> >> I think that, if get_user_pages() has been asked to grab zero pages,
> >> that's what it should do.  Thus this patch; it is, among other things,
> >> enough to block the (already fixed) root exploit and any others which
> >> might be lurking in similar code.  I also think that the number of pages
> >> should be unsigned, but changing the prototype of this function probably
> >> requires some more careful review.
> >>
> >> Signed-off-by: Jonathan Corbet <[EMAIL PROTECTED]>
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index e5628a5..7f50fd8 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct
> >> mm_struct *mm, int i;
> >>unsigned int vm_flags;
> >>
> >> +  if (len <= 0)
> >> +  return 0;
> >
> > BUG_ON()?
>
> Well, not if the code involved in the exploit can pass a zero value,

Which is a bug, and you want to catch it.


> otherwise it's just turning it into a DoS..

If it is due to a security bug, then the fix is to fix the point
where the kernel starts trusting an untrusted value. Not to hide
the bug like this. Arguably, a BUG_ON is better in the case of a
security hole because you want to halt the process as soon as you
detect a problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid buffer overflows in get_user_pages()

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote:
> Avoid buffer overflows in get_user_pages()
>
> So I spent a while pounding my head against my monitor trying to figure
> out the vmsplice() vulnerability - how could a failure to check for
> *read* access turn into a root exploit?  It turns out that it's a buffer
> overflow problem which is made easy by the way get_user_pages() is
> coded.
>
> In particular, "len" is a signed int, and it is only checked at the
> *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
> will execute once and decrement len to -1.  At that point, the loop will
> proceed until the next invalid address is found; in the process, it will
> likely overflow the pages array passed in to get_user_pages().
>
> I think that, if get_user_pages() has been asked to grab zero pages,
> that's what it should do.  Thus this patch; it is, among other things,
> enough to block the (already fixed) root exploit and any others which
> might be lurking in similar code.  I also think that the number of pages
> should be unsigned, but changing the prototype of this function probably
> requires some more careful review.
>
> Signed-off-by: Jonathan Corbet <[EMAIL PROTECTED]>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e5628a5..7f50fd8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, int i;
>   unsigned int vm_flags;
>
> + if (len <= 0)
> + return 0;

BUG_ON()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 15:10, Max Krasnyansky wrote:

 Rusty - Stop machine.
After doing a bunch of testing last three days I actually downgraded
 stop machine changes from [highly experimental] to simply [experimental].
 Pleas see this thread for more info:
 http://marc.info/?l=linux-kernelm=120243837206248w=2 Short story is that
 I ran several insmod/rmmod workloads on live multi-core boxes with stop
 machine _completely_ disabled and did no see any issues. Rusty did not get
 a chance to reply yet, I hopping that we'll be able to make stop machine
 completely optional for some configurations.

stop machine is used for more than just module loading and unloading.
I don't think you can just disable it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid buffer overflows in get_user_pages()

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 14:16, Robert Hancock wrote:
 Nick Piggin wrote:
  On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote:
  Avoid buffer overflows in get_user_pages()
 
  So I spent a while pounding my head against my monitor trying to figure
  out the vmsplice() vulnerability - how could a failure to check for
  *read* access turn into a root exploit?  It turns out that it's a buffer
  overflow problem which is made easy by the way get_user_pages() is
  coded.
 
  In particular, len is a signed int, and it is only checked at the
  *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
  will execute once and decrement len to -1.  At that point, the loop will
  proceed until the next invalid address is found; in the process, it will
  likely overflow the pages array passed in to get_user_pages().
 
  I think that, if get_user_pages() has been asked to grab zero pages,
  that's what it should do.  Thus this patch; it is, among other things,
  enough to block the (already fixed) root exploit and any others which
  might be lurking in similar code.  I also think that the number of pages
  should be unsigned, but changing the prototype of this function probably
  requires some more careful review.
 
  Signed-off-by: Jonathan Corbet [EMAIL PROTECTED]
 
  diff --git a/mm/memory.c b/mm/memory.c
  index e5628a5..7f50fd8 100644
  --- a/mm/memory.c
  +++ b/mm/memory.c
  @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct
  mm_struct *mm, int i;
 unsigned int vm_flags;
 
  +  if (len = 0)
  +  return 0;
 
  BUG_ON()?

 Well, not if the code involved in the exploit can pass a zero value,

Which is a bug, and you want to catch it.


 otherwise it's just turning it into a DoS..

If it is due to a security bug, then the fix is to fix the point
where the kernel starts trusting an untrusted value. Not to hide
the bug like this. Arguably, a BUG_ON is better in the case of a
security hole because you want to halt the process as soon as you
detect a problem.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid buffer overflows in get_user_pages()

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote:
 Avoid buffer overflows in get_user_pages()

 So I spent a while pounding my head against my monitor trying to figure
 out the vmsplice() vulnerability - how could a failure to check for
 *read* access turn into a root exploit?  It turns out that it's a buffer
 overflow problem which is made easy by the way get_user_pages() is
 coded.

 In particular, len is a signed int, and it is only checked at the
 *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
 will execute once and decrement len to -1.  At that point, the loop will
 proceed until the next invalid address is found; in the process, it will
 likely overflow the pages array passed in to get_user_pages().

 I think that, if get_user_pages() has been asked to grab zero pages,
 that's what it should do.  Thus this patch; it is, among other things,
 enough to block the (already fixed) root exploit and any others which
 might be lurking in similar code.  I also think that the number of pages
 should be unsigned, but changing the prototype of this function probably
 requires some more careful review.

 Signed-off-by: Jonathan Corbet [EMAIL PROTECTED]

 diff --git a/mm/memory.c b/mm/memory.c
 index e5628a5..7f50fd8 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct
 mm_struct *mm, int i;
   unsigned int vm_flags;

 + if (len = 0)
 + return 0;

BUG_ON()?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops report for the week upto Feb 10th 2008

2008-02-10 Thread Nick Piggin

On Monday 11 February 2008 11:35, Arjan van de Ven wrote:
> The http://www.kerneloops.org website collects kernel oops and
> warning reports from various mailing lists and bugzillas as well as
> with a client users can install to auto-submit oopses.
> Below is a top 10 list of the oopses/backtraces collected in the last 7
> days. (Reports prior to 2.6.23 have been omitted in collecting the top 10)
>
> This week, a total of 323 oopses and warnings have been reported,
> compared to 110 reports in the previous week.
>
> (This sharp increase is due to Fedora 9 alpha shipping the oops data
> collection client in the default install, giving us much wider coverage
> in the issues that actual users hit; many thanks to the Fedora project
> for this)
>
> With the 2.6.25-rc1 release out, this will be the last report that includes
> 2.6.23; future reports will only include issues from 2.6.24 and later.
>
>
> Rank 1: set_dentry_child_flags
>   WARN_ON at fs/inotify.c:172 set_dentry_child_flags
>   Reported 93 times (116 total reports)
>   This is a user triggered WARN_ON in inotify. Sadly inotify seems to be
> unmaintained. More info:
> http://www.kerneloops.org/search.php?search=set_dentry_child_flags

I was never able to trigger this or get anyone to reliably trigger it with
a debug patch in. Which is why it has taken so long to fix. It looks like
kde4 is triggering this big rash of new reports.

Anyway, I have fixed a race or two and removed that warning code (which was
also a little racy). So I think that should be OK.


> Rank 9: mark_buffer_dirty
>   WARN_ON at fs/buffer.c:1169
>   This indicates that a non-uptodate buffer is marked dirty.
>   This can lead to data corruption!
>   Reported 5 times (12 total reports) - Only seen since 2.6.24-rc6
>   Usually happens during umount()
>   More info: http://www.kerneloops.org/search.php?search=mark_buffer_dirty

That's interesting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops report for the week upto Feb 10th 2008

2008-02-10 Thread Nick Piggin

On Monday 11 February 2008 11:35, Arjan van de Ven wrote:
 The http://www.kerneloops.org website collects kernel oops and
 warning reports from various mailing lists and bugzillas as well as
 with a client users can install to auto-submit oopses.
 Below is a top 10 list of the oopses/backtraces collected in the last 7
 days. (Reports prior to 2.6.23 have been omitted in collecting the top 10)

 This week, a total of 323 oopses and warnings have been reported,
 compared to 110 reports in the previous week.

 (This sharp increase is due to Fedora 9 alpha shipping the oops data
 collection client in the default install, giving us much wider coverage
 in the issues that actual users hit; many thanks to the Fedora project
 for this)

 With the 2.6.25-rc1 release out, this will be the last report that includes
 2.6.23; future reports will only include issues from 2.6.24 and later.


 Rank 1: set_dentry_child_flags
   WARN_ON at fs/inotify.c:172 set_dentry_child_flags
   Reported 93 times (116 total reports)
   This is a user triggered WARN_ON in inotify. Sadly inotify seems to be
 unmaintained. More info:
 http://www.kerneloops.org/search.php?search=set_dentry_child_flags

I was never able to trigger this or get anyone to reliably trigger it with
a debug patch in. Which is why it has taken so long to fix. It looks like
kde4 is triggering this big rash of new reports.

Anyway, I have fixed a race or two and removed that warning code (which was
also a little racy). So I think that should be OK.


 Rank 9: mark_buffer_dirty
   WARN_ON at fs/buffer.c:1169
   This indicates that a non-uptodate buffer is marked dirty.
   This can lead to data corruption!
   Reported 5 times (12 total reports) - Only seen since 2.6.24-rc6
   Usually happens during umount()
   More info: http://www.kerneloops.org/search.php?search=mark_buffer_dirty

That's interesting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] block layer: kmemcheck fixes

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 02:56:09PM -0800, Arjan van de Ven wrote:
> Nick Piggin wrote:
> >>>Maybe cpus these days have so much store bandwith that doing
> >>>things like the above is OK, but I doubt it :-)
> >>on modern x86 cpus the memset may even be faster if the memory isn't in 
> >>cache;
> >>the "explicit" method ends up doing Write Allocate on the cache lines
> >>(so read them from memory) even though they then end up being written 
> >>entirely.
> >>With memset the CPU is told that the entire range is set to a new value, 
> >>and
> >>the WA can be avoided for the whole-cachelines in the range.
> >
> >Don't you have write combining store buffers? Or is it still speculatively
> >issuing the reads even before the whole cacheline is combined?
> 
> x86 memory order model doesn't allow that quite; and you need a "series" of 
> at least 64 bytes
> without any other memory accesses in between even if it would
> not happening in practice.

OK, fair enough... then it will be a very nice test to see if it
helps. I'm sure you could have an arch specific initialisation
function if it makes a significant difference.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] block layer: kmemcheck fixes

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 07:09:07AM -0800, Arjan van de Ven wrote:
> David Miller wrote:
> >From: Linus Torvalds <[EMAIL PROTECTED]>
> >Date: Thu, 7 Feb 2008 09:42:56 -0800 (PST)
> >
> >>Can we please just stop doing these one-by-one assignments, and just do 
> >>something like
> >>
> >>memset(rq, 0, sizeof(*rq));
> >>rq->q = q;
> >>rq->ref_count = 1;
> >>INIT_HLIST_NODE(>hash);
> >>RB_CLEAR_NODE(>rb_node);
> >>
> >>instead?
> >>
> >>The memset() is likely faster and smaller than one-by-one assignments 
> >>anyway, even if the one-by-ones can avoid initializing some field or 
> >>there ends up being a double initialization..
> >
> >The problem is store buffer compression.  At least a few years
> >ago this made a huge difference in sk_buff initialization in the
> >networking.
> >
> >Maybe cpus these days have so much store bandwith that doing
> >things like the above is OK, but I doubt it :-)
> 
> on modern x86 cpus the memset may even be faster if the memory isn't in 
> cache;
> the "explicit" method ends up doing Write Allocate on the cache lines
> (so read them from memory) even though they then end up being written 
> entirely.
> With memset the CPU is told that the entire range is set to a new value, and
> the WA can be avoided for the whole-cachelines in the range.

Don't you have write combining store buffers? Or is it still speculatively
issuing the reads even before the whole cacheline is combined?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 09:24:22AM +0100, Jens Axboe wrote:
> On Fri, Feb 08 2008, Nick Piggin wrote:
> > On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
> > > On Fri, Feb 08 2008, Nick Piggin wrote:
> > > > And if you don't?
> > > 
> > > Well if you don't ask for anything, you wont get anything :-)
> > > As I mentioned, the patch is a playing ground for trying various setups.
> > > Everything defaults to 'do as usual', set options to setup certain test
> > > scenarios.
> > 
> > I mean if you don't know the completing CPU.
> 
> I still don't know quite what part of that patch you are referring to
> here. If you don't have queue_affinity set, queueing a new request with
> the hardware is generally done on the same CPU that just completed a
> request. That is true even without any patches.

Generally, but I guess not always. The database workloads in question
(which you might know very well about ;)) apparently has a lot of
queue empty and unplug conditions. Which I guess is the reason for
Intel's initial patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
> On Fri, Feb 08 2008, Nick Piggin wrote:
> > And if you don't?
> 
> Well if you don't ask for anything, you wont get anything :-)
> As I mentioned, the patch is a playing ground for trying various setups.
> Everything defaults to 'do as usual', set options to setup certain test
> scenarios.

I mean if you don't know the completing CPU.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] more SLUB updates for 2.6.25

2008-02-08 Thread Nick Piggin

On Friday 08 February 2008 18:29, Eric Dumazet wrote:
> Nick Piggin a écrit :
> > On Friday 08 February 2008 13:13, Christoph Lameter wrote:
> >> are available in the git repository at:
> >>
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git
> >> slub-linus
> >>
> >> (includes the cmpxchg_local fastpath since the cmpxchg_local work
> >> by Matheiu is in now, and the non atomic unlock by Nick. Verified that
> >> this is not doing any harm after some other patches had been removed.
> >
> > Ah, good. I think it is always a good thing to be able to remove atomics.
> > They place quite a bit of burden on the CPU, especially x86 where it also
> > has implicit memory ordering semantics (although x86 can speculatively
> > get around much of the problem, it's obviously worse than no restriction)
> >
> > Even if perhaps some cache coherency or timing quirk makes the non-atomic
> > version slower (all else being equal), then I'd still say that the non
> > atomic version should be preferred.
>
> What about IRQ masking then ?

I really did mean all else being equal. eg. "clear_bit" vs "__clear_bit".


> Many CPU pay high cost for cli/sti pair...

True, and many UP architectures have to implement atomic operations
with cli/sti pairs... so those are more reasons to use non-atomics.


> And SLAB/SLUB allocators, even if only used from process context, want to
> disable/re-enable interrupts...
>
> I understand kmalloc() want generic pools, but dedicated pools could avoid
> this cli/sti

Sure, I guess that would be possible. I've kind of toyed with doing
some cli/sti mitigation in the page allocator, but in that case I
found that it wasn't a win outside microbenchmarks: the cache
characteristics of the returned pages are just as important if not
more so than cli/sti costs (although that balance would change
depending on the CPU and workload I guess).

For slub yes you could do it with fewer downsides with process context
pools.

Is it possible instead for architectures where cli/sti is so expensive
to change their lowest level of irq handling to do this by setting and
clearing a soft flag somewhere? That's what I'd rather see, if possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] block layer: kmemcheck fixes

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 02:56:09PM -0800, Arjan van de Ven wrote:
 Nick Piggin wrote:
 Maybe cpus these days have so much store bandwith that doing
 things like the above is OK, but I doubt it :-)
 on modern x86 cpus the memset may even be faster if the memory isn't in 
 cache;
 the explicit method ends up doing Write Allocate on the cache lines
 (so read them from memory) even though they then end up being written 
 entirely.
 With memset the CPU is told that the entire range is set to a new value, 
 and
 the WA can be avoided for the whole-cachelines in the range.
 
 Don't you have write combining store buffers? Or is it still speculatively
 issuing the reads even before the whole cacheline is combined?
 
 x86 memory order model doesn't allow that quite; and you need a series of 
 at least 64 bytes
 without any other memory accesses in between even if it would
 not happening in practice.

OK, fair enough... then it will be a very nice test to see if it
helps. I'm sure you could have an arch specific initialisation
function if it makes a significant difference.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3926 matches

Mail list logo