from:"\"Nick Piggin\""

Re: oops when using git gc --auto

2008-02-26 Thread Nick Piggin

On Wednesday 27 February 2008 00:22, Otavio Salvador wrote:
> Hello,
>
> Today I got this oops, someone has an idea of what's going wrong?
>
> Unable to handle kernel paging request at 0200 RIP:
>  [] find_get_pages+0x3c/0x69

At this point, the most likely candidate is a memory corruption
error, probably hardware. Can you run memtest86 for a few hours
to get a bit more confidence in the hw (preferably overnight)?

I did recently see another quite similar corruption in the
pagecache radix-tree, though. Coincidence maybe?

> PGD 0
> Oops:  [1] SMP
> CPU 3
> Modules linked in: sha256_generic aes_generic aes_x86_64 cbc blkcipher
> nvidia(P) rfcomm l2cap bluetooth ac battery ipv6 nfs lockd nfs_acl sunrpc
> bridge ext2 mbcache dm_crypt tun kvm_intel kvm loop snd_usb_audio
> snd_usb_lib snd_rawmidi snd_hda_intel e1000e i2c_i801 serio_raw
> snd_seq_device snd_pcm intel_agp button snd_timer pcspkr psmouse snd_hwdep
> snd snd_page_alloc soundcore evdev i2c_core xfs dm_mirror dm_snapshot
> dm_mod raid0 md_mod sg sr_mod cdrom sd_mod usbhid hid usb_storage
> pata_marvell floppy ahci ata_generic libata scsi_mod ehci_hcd uhci_hcd
> thermal processor fan Pid: 15684, comm: git Tainted: P   
> 2.6.24-1-amd64 #1
> RIP: 0010:[]  []
> find_get_pages+0x3c/0x69 RSP: 0018:8100394dfd98  EFLAGS: 00010097
> RAX: 0009 RBX: 000e RCX: 0009
> RDX: 0200 RSI: 000a RDI: 0040
> RBP: 810042964350 R08: 0040 R09: 000a
> R10: 8100425a06c8 R11: 000a R12: 000e
> R13: 8100394dfdf8 R14: 810042964350 R15: 
> FS:  2ae326df2190() GS:81007d7aeb40()
> knlGS: CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 0200 CR3: 358f9000 CR4: 26e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process git (pid: 15684, threadinfo 8100394de000, task
> 8100359cd800) Stack:  000d 8100394dfde8
> 000d 000e 000e 802794d6
> 8100014a7768 80279b04  
>   Call Trace:
>  [] pagevec_lookup+0x17/0x1e
>  [] truncate_inode_pages_range+0x108/0x2bd
>  [] generic_delete_inode+0xbf/0x127
>  [] do_unlinkat+0xd5/0x144
>  [] sys_write+0x45/0x6e
>  [] system_call+0x7e/0x83
>
>
> Code: 48 8b 02 25 00 40 02 00 48 3d 00 40 02 00 75 04 48 8b 52 10
> RIP  [] find_get_pages+0x3c/0x69
>  RSP 
> CR2: 0200
> ---[ end trace cb43a9f4488b815a ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:21, Gleb Natapov wrote:
> On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote:
> > > You are missing one point here.  The MPI specifications that have
> > > been out there for decades do not require the process use a library
> > > for allocating the buffer.  I realize that is a horrible shortcoming,
> > > but that is the world we live in.  Even if we could change that spec,
> >
> > Can you change the spec?
>
> Not really. It will break all existing codes.

I meant as in eg. submit changes to MPI-3


> MPI-2 provides a call for 
> memory allocation (and it's beneficial to use this call for some
> interconnects), but many (most?) applications are still written for MPI-1
> and those that are written for MPI-2 mostly uses the old habit of
> allocating memory by malloc(), or even use stack or BSS memory for
> communication buffer purposes.

OK, so MPI-2 already has some way to do that... I'm not saying that we
can now completely dismiss the idea of using notifiers for this, but it
is just a good data point to know.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
> Andrew Morton wrote:
> > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> 
wrote:
> > > (It would be nicer if sync_file_range()
> > > took a vector of ranges for better elevator scheduling, but let's
> > > ignore that :-)
> >
> > Two passes:
> >
> > Pass 1: shove each of the segments into the queue with
> > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> >
> > Pass 2: wait for them all to complete and return accumulated result
> > with SYNC_FILE_RANGE_WAIT_AFTER
>
> Thanks.
>
> Seems ok, though being able to cork the I/O until the last one would
> be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)
>
> I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
> reason why you have it there?  The man page isn't very enlightening.

Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, "writeout", to userspace (does "writeout" include
"in drive cache"? the kernel doesn't think so, but the only way to
make sync_file_range "safe" is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-25 Thread Nick Piggin

On Thursday 21 February 2008 21:58, Robin Holt wrote:
> On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote:
> > > > So why can't you export a device from your xpmem driver, which
> > > > can be mmap()ed to give out "anonymous" memory pages to be used
> > > > for these communication buffers?
> > >
> > > Because we need to have heap and stack available as well.  MPT does
> > > not control all the communication buffer areas.  I haven't checked, but
> > > this is the same problem that IB will have.  I believe they are
> > > actually allowing any memory region be accessible, but I am not sure of
> > > that.
> >
> > Then you should create a driver that the user program can register
> > and unregister regions of their memory with. The driver can do a
> > get_user_pages to get the pages, and then you'd just need to set up
> > some kind of mapping so that userspace can unmap pages / won't leak
> > memory (and an exit_mm notifier I guess).
>
> OK.  You need to explain this better to me.  How would this driver
> supposedly work?  What we have is an MPI library.  It gets invoked at
> process load time to establish its rank-to-rank communication regions.
> It then turns control over to the processes main().  That is allowed to
> run until it hits the
>   MPI_Init(&argc, &argv);
>
> The process is then totally under the users control until:
>   MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, 
> MPI_COMM_WORLD);
>   MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD,
> &status);
>
> That is it.  That is all our allowed interaction with the users process.

OK, when you said something along the lines of "the MPT library has
control of the comm buffer", then I assumed it was an area of virtual
memory which is set up as part of initialization, rather than during
runtime. I guess I jumped to conclusions.


> That doesn't seem too unreasonable, except when you compare it to how the
> driver currently works.  Remember, this is done from a library which has
> no insight into what the user has done to its own virtual address space.
> As a result, each MPI_Send() would result in a system call (or we would
> need to have a set of callouts for changes to a processes VMAs) which
> would be a significant increase in communication overhead.
>
> Maybe I am missing what you intend to do, but what we need is a means of
> tracking one processes virtual address space changes so other processes
> can do direct memory accesses without the need for a system call on each
> communication event.

Yeah it's tricky. BTW. what is the performance difference between
having a system call or no?


> > Because you don't need to swap, you don't need coherency, and you
> > are in control of the areas, then this seems like the best choice.
> > It would allow you to use heap, stack, file-backed, anything.
>
> You are missing one point here.  The MPI specifications that have
> been out there for decades do not require the process use a library
> for allocating the buffer.  I realize that is a horrible shortcoming,
> but that is the world we live in.  Even if we could change that spec,

Can you change the spec? Are you working on it?


> we would still need to support the existing specs.  As a result, the
> user can change their virtual address space as they need and still expect
> communications be cheap.

That's true. How has it been supported up to now? Are you using
these kind of notifiers in patched kernels?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-25 Thread Nick Piggin

On Wednesday 20 February 2008 09:01, Alexey Dobriyan wrote:
> On Tue, Feb 19, 2008 at 11:47:11PM +0300,  wrote:

> > > Are you reproducing it simply by running the
> > > ftest03 binary directly from the shell? How many times between oopses?
> > > It is multi-process but no threads, so races should be minimal down
> > > this path -- can you get an strace of the failing process?
>
> Speaking of multi-proceseness, changing MAXCHILD to 1, nchild to 1,
> AFAICS, generates one child which oopses the very same way (in parallel
> with generic LTP) But, lowering MAXIOVCNT to 8 generates no oops.

Thanks, I was able to reproduce quite easily with these settings.
I think I have the correct patch now (at least it isn't triggerable
any more here).

Thanks,
Nick
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c74b68..2650073 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1750,14 +1750,18 @@ static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
 	} else {
 		const struct iovec *iov = i->iov;
 		size_t base = i->iov_offset;
+		size_t copied = 0;
 
 		/*
 		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments (without overruning the iovec).
 		 */
-		while (bytes || !iov->iov_len) {
-			int copy = min(bytes, iov->iov_len - base);
+		while (copied < bytes ||
+unlikely(!iov->iov_len && copied < i->count)) {
+			int copy;
 
+			copy = min(bytes, iov->iov_len - base);
+			copied += copy;
 			bytes -= copy;
 			base += copy;
 			if (iov->iov_len == base) {

Re: [PATCH] alloc_percpu() fails to allocate percpu data

2008-02-23 Thread Nick Piggin

On Friday 22 February 2008 09:26, Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
> > Some oprofile results obtained while using tbench on a 2x2 cpu machine
> > were very surprising.
> >
> > For example, loopback_xmit() function was using high number of cpu
> > cycles to perform the statistic updates, supposed to be real cheap
> > since they use percpu data
> >
> > pcpu_lstats = netdev_priv(dev);
> > lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> > lb_stats->packets++;  /* HERE : serious contention */
> > lb_stats->bytes += skb->len;
> >
> >
> > struct pcpu_lstats is a small structure containing two longs. It
> > appears that on my 32bits platform, alloc_percpu(8) allocates a single
> > cache line,  instead of giving to each cpu a separate cache line.
> >
> > Using the following patch gave me impressive boost in various
> > benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
> >
> > Long term fix (ie >= 2.6.26) would be to let each CPU allocate their
> > own block of memory, so that we dont need to roudup sizes to
> > L1_CACHE_BYTES, or merging the SGI stuff of course...
> >
> > Note : SLUB vs SLAB is important here to *show* the improvement, since
> > they dont have the same minimum allocation sizes (8 bytes vs 32
> > bytes). This could very well explain regressions some guys reported
> > when they switched to SLUB.
>
> I've complained about this false sharing as well, so until we get the
> new and improved percpu allocators,

What I don't understand is why the slab allocators have something like
this in it:

if ((flags & SLAB_HWCACHE_ALIGN) &&
size > cache_line_size() / 2)
return max_t(unsigned long, align, cache_line_size());

If you ask for HWCACHE_ALIGN, then you should get it. I don't
understand, why do they think they knows better than the caller?
Things like this are just going to lead to very difficult to track
performance problems. Possibly correctness problems in rare cases.

There could be another flag for "maybe align".

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Document huge memory/cache overhead of memory controller in Kconfig

2008-02-21 Thread Nick Piggin

On Wednesday 20 February 2008 23:52, Balbir Singh wrote:
> Andi Kleen wrote:
> > Document huge memory/cache overhead of memory controller in Kconfig
> >
> > I was a little surprised that 2.6.25-rc* increased struct page for the
> > memory controller.  At least on many x86-64 machines it will not fit into
> > a single cache line now anymore and also costs considerable amounts of
> > RAM.
>
> The size of struct page earlier was 56 bytes on x86_64 and with 64 bytes it
> won't fit into the cacheline anymore? Please also look at
> http://lwn.net/Articles/234974/

BTW. We'll probably want to increase the width of some counters
in struct page at some point for 64-bit, so then it really will
go over with the memory controller!

Actually, an external data structure is a pretty good idea. We
could probably do it easily with a radix tree (pfn->memory
controller). And that might be a better option for distros.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v6

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 01:03:24PM +0100, Andrea Arcangeli wrote:
> If there's agreement that the VM should alter its locking from
> spinlock to mutex for its own good, then Christoph's
> one-config-option-fits-all becomes a lot more appealing (replacing RCU
> with a mutex in the mmu notifier list registration locking isn't my
> main worry and the non-sleeping-users may be ok to live with it).

Just from a high level view, in some cases we can just say that no we
aren't going to support this. And this may well be one of those cases.

The more constraints placed on the VM, the harder it becomes to
improve and adapt in future. And this seems like a pretty big restriction.
(especially if we can eg. work around it completely by having a special
purpose driver to get_user_pages on comm buffers as I suggested in the
other mail).

At any rate, I believe Andrea's patch really places minimal or no further
constraints than a regular CPU TLB (or the hash tables that some archs
implement). So we're kind of in 2 different leagues here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmu notifiers #v6

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 11:39:42AM +0100, Andrea Arcangeli wrote:
> Given Nick's comments I ported my version of the mmu notifiers to
> latest mainline. There are no known bugs AFIK and it's obviously safe
> (nothing is allowed to schedule inside rcu_read_lock taken by
> mmu_notifier() with my patch).

Thanks! Yes the seqlock you are using now ends up looking similar
to what I did and I couldn't find a hole in that either. So I
think this is going to work.

I do prefer some parts of my patch, however for everyone's sanity,
I think you should be the maintainer of the mmu notifiers, and I
will send you incremental changes that can be discussed more easily
that way (nothing major, mainly style and minor things).

> XPMEM simply can't use RCU for the registration locking if it wants to
> schedule inside the mmu notifier calls. So I guess it's better to add
> the XPMEM invalidate_range_end/begin/external-rmap as a whole
> different subsystem that will have to use a mutex (not RCU) to
> serialize, and at the same time that CONFIG_XPMEM will also have to
> switch the i_mmap_lock to a mutex. I doubt xpmem fits inside a
> CONFIG_MMU_NOTIFIER anymore, or we'll all run a bit slower because of
> it. It's really a call of how much we want to optimize the MMU
> notifier, by keeping things like RCU for the registration.

I agree: your coherent, non-sleeping mmu notifiers are pretty simple
and unintrusive. The sleeping version is fundamentally going to either
need to change VM locks, or be non-coherent, so I don't think there is
a question of making one solution fit everybody. So the sleeping /
xrmap patch should be kept either completely independent, or as an
add-on to this one.

I will post some suggestions to you when I get a chance.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-20 Thread Nick Piggin

On Wed, Feb 20, 2008 at 02:09:41AM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
> > Sorry, I realise I still didn't get this through my head yet (and also
> > have not seen your patch recently). So I don't know exactly what you
> > are doing...
> 
> The last version was posted here:
> 
> http://marc.info/?l=kvm-devel&m=120321732521533&w=2
> 
> > But why does _anybody_ (why does Christoph's patches) need to invalidate
> > when they are going to be more permissive? This should be done lazily by
> > the driver, I would have thought.
> 
> This can be done lazily by the driver yes. The place where I've an
> invalidate_pages in mprotect however can also become less permissive.

That's OK, because we have to flush tlbs there too.


> It's simpler to invalidate always and it's not guaranteed the
> secondary mmu page fault is capable of refreshing the spte across a
> writeprotect fault.

I think we just have to make sure that it _can_ do writeprotect
faults. AFAIKS, that will be possible if the driver registers a
.page_mkwrite handler (actually not quite -- page_mkwrite is fairly
crap, so I have a patch to merge it together with .fault so we get
address information as well). Anyway, I really think we should do
it that way.

> In the future this can be changed to
> mprotect_pages though, so no page fault will happen in the secondary
> mmu.

Possibly, but hopefully not needed for performance. Let's wait and
see.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-20 Thread Nick Piggin

On Tue, Feb 19, 2008 at 05:40:50PM -0600, Jack Steiner wrote:
> On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
> > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > > > anything when changing the pte to be _more_ permissive, and I don't
> > > 
> > > Note that in my patch the invalidate_pages in mprotect can be
> > > trivially switched to a mprotect_pages with proper params. This will
> > > prevent page faults completely in the secondary MMU (there will only
> > > be tlb misses after the tlb flush just like for the core linux pte),
> > > and it'll allow all the secondary MMU pte blocks (512/1024 at time
> > > with my PT lock design) to be updated to have proper permissions
> > > matching the core linux pte.
> > 
> > Sorry, I realise I still didn't get this through my head yet (and also
> > have not seen your patch recently). So I don't know exactly what you
> > are doing...
> > 
> > But why does _anybody_ (why does Christoph's patches) need to invalidate
> > when they are going to be more permissive? This should be done lazily by
> > the driver, I would have thought.
> 
> 
> Agree. Although for most real applications, the performance difference
> is probably negligible.

But importantly, doing it that way means you share test coverage with
the CPU TLB flushing code, and you don't introduce a new concept to the
VM.

So, it _has_ to be lazy flushing, IMO (as there doesn't seem to be a
good reason otherwise). mprotect shouldn't really be a special case,
because it still has to flush the CPU tlbs as well when restricting
access.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-20 Thread Nick Piggin

On Wednesday 20 February 2008 20:00, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:51:45PM +1100, Nick Piggin wrote:
> > On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> > > For XPMEM, we do not currently allow file backed
> > > mapping pages from being exported so we should never reach this
> > > condition. It has been an issue since day 1.  We have operated with
> > > that assumption for 6 years and have not had issues with that
> > > assumption.  The user of xpmem is MPT and it controls the communication
> > > buffers so it is reasonable to expect this type of behavior.
> >
> > OK, that makes things simpler.
> >
> > So why can't you export a device from your xpmem driver, which
> > can be mmap()ed to give out "anonymous" memory pages to be used
> > for these communication buffers?
>
> Because we need to have heap and stack available as well.  MPT does
> not control all the communication buffer areas.  I haven't checked, but
> this is the same problem that IB will have.  I believe they are actually
> allowing any memory region be accessible, but I am not sure of that.

Then you should create a driver that the user program can register
and unregister regions of their memory with. The driver can do a
get_user_pages to get the pages, and then you'd just need to set up
some kind of mapping so that userspace can unmap pages / won't leak
memory (and an exit_mm notifier I guess).

Because you don't need to swap, you don't need coherency, and you
are in control of the areas, then this seems like the best choice.
It would allow you to use heap, stack, file-backed, anything.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-19 Thread Nick Piggin

On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> For XPMEM, we do not currently allow file backed
> mapping pages from being exported so we should never reach this condition.
> It has been an issue since day 1.  We have operated with that assumption
> for 6 years and have not had issues with that assumption.  The user of
> xpmem is MPT and it controls the communication buffers so it is reasonable
> to expect this type of behavior.

OK, that makes things simpler.

So why can't you export a device from your xpmem driver, which
can be mmap()ed to give out "anonymous" memory pages to be used
for these communication buffers?

I guess you may also want an "munmap/mprotect" callback, which
we don't have in the kernel right now... but at least you could
prototype it easily by having an ioctl to be called before
munmapping or mprotecting (eg. the ioctl could prevent new TLB
setup for the region, and shoot down existing ones).

This is actually going to be much faster for you if you use any
threaded applications, because you will be able to do all the
shootdown round trips outside mmap_sem, and so you will be able
to have other threads faulting and even mmap()ing / munmaping
at the same time as the shootdown is happening.

I guess there is some catch...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Wednesday 20 February 2008 14:00, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:

> > > Also, how to you resolve the case where you are not allowed to sleep?
> > > I would have thought either you have to handle it, in which case nobody
> > > needs to sleep; or you can't handle it, in which case the code is
> > > broken.
> >
> > I also asked exactly this, glad you reasked this too.
>
> Currently, we BUG_ON having a PFN in our tables and not being able
> to sleep.  These are mappings which MPT has never supported in the past
> and XPMEM was already not allowing page faults for VMAs which are not
> anonymous so it should never happen.  If the file-backed operations can
> ever get changed to allow for sleeping and a customer has a need for it,
> we would need to change XPMEM to allow those types of faults to succeed.

Do you really want to be able to swap, or are you just interested
in keeping track of unmaps / prot changes?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> These special additional callbacks are required because XPmem (and likely
> other mechanisms) do use their own rmap (multiple processes on a series
> of remote Linux instances may be accessing the memory of a process).
> F.e. XPmem may have to send out notifications to remote Linux instances
> and receive confirmation before a page can be freed.
>
> So we handle this like an additional Linux reverse map that is walked after
> the existing rmaps have been walked. We leave the walking to the driver
> that is then able to use something else than a spinlock to walk its reverse
> maps. So we can actually call the driver without holding spinlocks while we
> hold the Pagelock.

I don't know how this is supposed to solve anything. The sleeping
problem happens I guess mostly in truncate. And all you are doing
is putting these rmap callbacks in page_mkclean and try_to_unmap.

> However, we cannot determine the mm_struct that a page belongs to at
> that point. The mm_struct can only be determined from the rmaps by the
> device driver.
>
> We add another pageflag (PageExternalRmap) that is set if a page has
> been remotely mapped (f.e. by a process from another Linux instance).
> We can then only perform the callbacks for pages that are actually in
> remote use.
>
> Rmap notifiers need an extra page bit and are only available
> on 64 bit platforms. This functionality is not available on 32 bit!
>
> A notifier that uses the reverse maps callbacks does not need to provide
> the invalidate_page() method that is called when locks are held.

That doesn't seem right. To start with, the new callbacks aren't
even called in the places where invalidate_page isn't allowed to
sleep.

The problem is unmap_mapping_range, right? And unmap_mapping_range
must walk the rmaps with the mmap lock held, which is why it can't
sleep. And it can't hold any mmap_sem so it cannot prevent address
space modifications of the processes in question between the time
you unmap them from the linux ptes with unmap_mapping_range, and the
time that you unmap them from your driver.

So in the meantime, you could have eg. a fault come in and set up a
new page for one of the processes, and that page might even get
exported via the same external driver. And now you have a totally
inconsistent view.

Preventing new mappings from being set up until the old mapping is
completely flushed is basically what we need to ensure for any sane
TLB as far as I can tell. To do that, you'll need to make the mmap
lock sleep, and either take mmap_sem inside it (which is a
deadlock condition at the moment), or make ptl sleep as well. These
are simply the locks we use to prevent that from happening, so I
can't see how you can possibly hope to have a coherent TLB without
invalidating inside those locks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > anything when changing the pte to be _more_ permissive, and I don't
> 
> Note that in my patch the invalidate_pages in mprotect can be
> trivially switched to a mprotect_pages with proper params. This will
> prevent page faults completely in the secondary MMU (there will only
> be tlb misses after the tlb flush just like for the core linux pte),
> and it'll allow all the secondary MMU pte blocks (512/1024 at time
> with my PT lock design) to be updated to have proper permissions
> matching the core linux pte.

Sorry, I realise I still didn't get this through my head yet (and also
have not seen your patch recently). So I don't know exactly what you
are doing...

But why does _anybody_ (why does Christoph's patches) need to invalidate
when they are going to be more permissive? This should be done lazily by
the driver, I would have thought.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
>
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.

You can't sleep inside rcu_read_lock()!

I must say that for a patch that is up to v8 or whatever and is
posted twice a week to such a big cc list, it is kind of slack to
not even test it and expect other people to review it.

Also, what we are going to need here are not skeleton drivers
that just do all the *easy* bits (of registering their callbacks),
but actual fully working examples that do everything that any
real driver will need to do. If not for the sanity of the driver
writer, then for the sanity of the VM developers (I don't want
to have to understand xpmem or infiniband in order to understand
how the VM works).



> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
>
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.
>
> Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
> Signed-off-by: Robin Holt <[EMAIL PROTECTED]>
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
>
> ---
>  mm/filemap_xip.c |5 +
>  mm/fremap.c  |3 +++
>  mm/hugetlb.c |3 +++
>  mm/memory.c  |   35 +--
>  mm/mmap.c|2 ++
>  mm/mprotect.c|3 +++
>  mm/mremap.c  |7 ++-
>  7 files changed, 51 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/mm/fremap.c
> ===
> --- linux-2.6.orig/mm/fremap.c2008-02-14 18:43:31.0 -0800
> +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.0 -0800
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
>   spin_unlock(&mapping->i_mmap_lock);
>   }
>
> + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
>   err = populate_range(mm, vma, start, size, pgoff);
> + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
>   if (!err && !(flags & MAP_NONBLOCK)) {
>   if (unlikely(has_write_lock)) {
>   downgrade_write(&mm->mmap_sem);
> Index: linux-2.6/mm/memory.c
> ===
> --- linux-2.6.orig/mm/memory.c2008-02-14 18:43:31.0 -0800
> +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.0 -0800
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
>   if (is_vm_hugetlb_page(vma))
>   return copy_hugetlb_page_range(dst_mm, src_mm, vma);
>
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
> +
>   dst_pgd = pgd_offset(dst_mm, addr);
>   src_pgd = pgd_offset(src_mm, addr);
>   do {
> @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
>   vma, addr, next))
>   return -ENOMEM;
>   } while (dst_pgd++, src_pgd++, addr = next, addr != end);
> +
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier(invalidate_range_end, src_mm,
> + vma->vm_start, end, 0);
> +
>   return 0;
>  }
>
> @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
>   struct mmu_gather *tlb;
>   unsigned long end = address + size;
>   unsigned long nr_accounted = 0;
> + int atomic = details ? (details->i_mmap_lock != 0) : 0;
>
>   lru_add_drain();
>   tlb = tlb_gather_mmu(mm, 0);
>   update_hiwater_rss(mm);
> + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
>   end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
>   if (tlb)
>   tlb_finish_mmu(tlb, address, end);
> + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
>   return end;
>  }
>

Where do you invalidate for munmap()?

Also, how to you resolve the case where you are not allowed to sleep?
I would have thought either you have to handle it, in which case nobody
needs to sleep; or you can't handle it, in which case the code is
broken.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote:
> > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > > understand the need for invalidate_begin/invalidate_end pairs at all.
> > 
> > The need of the pairs is crystal clear to me: range_begin is needed
> > for GRU _but_only_if_ range_end is called after releasing the
> > reference that the VM holds on the page. _begin will flush the GRU tlb
> > and at the same time it will take a mutex that will block further GRU
> > tlb-miss-interrupts (no idea how they manange those nightmare locking,
> > I didn't even try to add more locking to KVM and I get away with the
> > fact KVM takes the pin on the page itself).
> 
> As it turns out, no actual mutex is required. _begin_ simply increments a
> count of active range invalidates, _end_ decrements the count. New TLB
> dropins are deferred while range callouts are active.
> 
> This would appear to be racy but the GRU has special hardware that
> simplifies locking. When the GRU sees a TLB invalidate, all outstanding
> misses & potentially inflight TLB dropins are marked by the GRU with a
> "kill" bit. When the dropin finally occurs, the dropin is ignored & the
> instruction is simply restarted. The instruction will fault again & the TLB
> dropin will be repeated.  This is optimized for the case where invalidates
> are rare - true for users of the GRU.

OK (thanks to Robin as well). Now I understand why you are using it,
but I don't understand why you don't defer new TLBs after the point
where the linux pte changes. If you can do that, then you look and
act much more like a TLB from the point of view of the Linux vm.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > are rather similar. However I have tried to make a point of minimising the
> > impact the the core mm/. I don't see why we need to invalidate or flush
> 
> I also tried hard to minimise the impact of the core mm/, I also
> argued with Christoph that cluttering mm/ wasn't a good idea for
> things like age_page that could be a 1 liner change instead of a
> multiple-liner change, without any loss of flexibility or readability.
> 
> > anything when changing the pte to be _more_ permissive, and I don't
> 
> Note that in my patch the invalidate_pages in mprotect can be
> trivially switched to a mprotect_pages with proper params. This will
> prevent page faults completely in the secondary MMU (there will only
> be tlb misses after the tlb flush just like for the core linux pte),
> and it'll allow all the secondary MMU pte blocks (512/1024 at time
> with my PT lock design) to be updated to have proper permissions
> matching the core linux pte.
> 
> > understand the need for invalidate_begin/invalidate_end pairs at all.
> 
> The need of the pairs is crystal clear to me: range_begin is needed
> for GRU _but_only_if_ range_end is called after releasing the
> reference that the VM holds on the page. _begin will flush the GRU tlb
> and at the same time it will take a mutex that will block further GRU
> tlb-miss-interrupts (no idea how they manange those nightmare locking,
> I didn't even try to add more locking to KVM and I get away with the
> fact KVM takes the pin on the page itself).
> 
> My patch calls invalidate_page/pages before the reference is released
> on the page, so GRU will work fine despite lack of
> range_begin. Furthermore with my patch GRU will be auto-serialized by
> the PT lock w/o the need of any additional locking.

That's why I don't understand the need for the pairs: it should be
done like this.


> > What I have done is basically create it so that the notifiers get called
> > basically in the same place as the normal TLB flushing is done, and nowhere
> > else.
> 
> That was one of my objectives too.
> 
> > I also wanted to avoid calling notifier code from inside eg. hardware TLB
> > or pte manipulation primitives. These things are already pretty well
> > spaghetti, so I'd like to just place them right where needed first... I
> > think eventually it will need a bit of a rethink to make it more consistent
> > and more general. But I prefer to do put them in the caller for the moment.
> 
> Your patch should also work for KVM but it's suboptimal, my patch can
> be orders of magnitude more efficient for GRU thanks to the
> invalidate_pages optimization. Christoph complained about having to
> call one method per pte.

OK, I didn't see the invalidate_pages call...

 
> And adding invalidate_range is useless unless you fully support
> xpmem. You're calling invalidate_range in places that can't sleep...

I thought that could be used by a non-sleeping user (not intending
to try supporting sleeping users). If it is useless then it should
go away (BTW. I didn't see your recent patch, some of my confusion
I think stems from Christoph's novel way of merging and splitting
patches).


> No idea why xpmem needs range_begin, I perfectly understand why GRU
> needs _begin with Chrisotph's patch (gru lacks the page pin) but I
> dunno why xpmem needs range_begin (xpmem has the page pin so I also
> think it could avoid using range_begin). Still to support GRU you need
> both to call invalidate_range in places that can sleep and you need
> the external rmap notifier. The moment you add xpmem into the equation
> your and my clean patches become Christoph's one...

Sorry, I kind of didn't have time to follow the conversation so well
before; are there patches posted for gru and/or xpmem?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 20:57, Andi Kleen wrote:
> On Tue, Feb 19, 2008 at 08:46:46PM +1100, Nick Piggin wrote:

> > I think it was just a simple context switch benchmark, but not lmbench
> > (which I found to be a bit too variable). But it was a long time ago...
>
> Do you still have it?
>
> I thought about writing my own but ended up being too lazy for that @)

Had a quick look but couldn't find it. It was just two threads running
and switching to each other with a couple of mutexes or yield. If I
find it, then I'll send it over.

> > > > Actually one thing I don't like about gcc is that I think it still
> > > > emits cmovs for likely/unlikely branches,
> > >
> > > That's -Os.
> >
> > And -O2 and -O3, on the gccs that I'm using, AFAIKS.
>
> Well if it still happens on gcc 4.2 with P4 tuning you should
> perhaps open a gcc PR. They tend to ignore these bugs mostly in
> my experience, but sometimes they act on them.

I'm not sure about P4 tuning... But even IMO it should not on
predictable branches too much for any (especially OOOE) CPU.

> > > > which is silly (the gcc developers
> > >
> > > It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
> > > makes sense. P4 doesn't like it though.
> >
> > If the branch is completely predictable (eg. annotated), then I
> > think branches should be used anyway. Even on well predicted
> > branches, cmov is similar speed on microbenchmarks, but it will
> > increase data hazards I think, so it will probably be worse for
> > some real world situations.
>
> At least the respective optimization manuals say they should be used.
> I presume they only made this recommendation after some extensive
> benchmarking.

What I have seen is that they tell you definitely not to use it for
predictable branches. Eg. the Intel optimization manual says

 Use the setcc and cmov instructions to eliminate unpredictable
 conditional branches where possible. Do not do this for predictable
 branches. Do not use these instructions to eliminate all
 unpredictable conditional branches, because using these instructions
 will incur execution overhead due to executing both paths of a
 conditional branch. In addition, converting conditional branches to
 cmovs or setcc trades control-flow dependence for data dependence
 and restricts the capability of the out-of-order engine.

> > But a likely branch will be _strongly_ predicted to be taken,
> > wheras a lot of the gcc heuristics simply have slightly more or
> > slightly less probability. So it's not just a question of which
> > way is more likely, but also _how_ likely it is to go that way.
>
> Yes, but a lot of the heuristics are pretty strong (>80%) and gcc will
> act on them unless it has a very strong contra cue. And that should
> normally not be the case.

True, but if you know a branch is 99%+, then use of likely/unlikely
can still be a good idea. 80% may not be enough to choose a branch
over a cmov for example.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 20:25, Andi Kleen wrote:
> On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:

> > I actually once measured context switching performance in the scheduler,
> > and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
> > performance drop.
>
> OT: what benchmarks did you use for that? I had a change some time
> ago to the CFS scheduler to avoid unpredicted indirect calls for
> the common case, but I wasn't able to benchmark a difference with the usual
> suspect benchmark (lmbench). Since it increased code size by
> a few bytes it was rejected then.

I think it was just a simple context switch benchmark, but not lmbench
(which I found to be a bit too variable). But it was a long time ago...

> > This was on a P4 which is very different from more modern CPUs both in
> > terms of branch performance characteristics,
> >
> > and icache characteristics.
>
> Hmm, the P4 the trace cache actually should not care about inline
> code that is not executed.

Yeah, which is why it is a bit different than other CPUs. Although
the L2 cache I guess is still going to suffer from sparse code, but
I guess that is a bit less important.

> > However, the P4's branch predictor is pretty good, and it should easily
>
> I think it depends on the generation. Prescott class branch
> prediction should be much better than the earlier ones.

I was using a Nocona Xeon, which I think is a Prescott class? And
don't they have much higher mispredict penalty (than older P4s)?

> > Actually one thing I don't like about gcc is that I think it still emits
> > cmovs for likely/unlikely branches,
>
> That's -Os.

And -O2 and -O3, on the gccs that I'm using, AFAIKS.

> > which is silly (the gcc developers
>
> It depends on the CPU. e.g. on K8 and P6 using CMOV if possible
> makes sense. P4 doesn't like it though.

If the branch is completely predictable (eg. annotated), then I
think branches should be used anyway. Even on well predicted
branches, cmov is similar speed on microbenchmarks, but it will
increase data hazards I think, so it will probably be worse for
some real world situations.

> > the quite good numbers that cold CPU predictors can attain. However
> > for really performance critical code (or really "never" executed
> > code), then I think it is OK to have the hints and not have to rely
> > on gcc heuristics.
>
> But only when the explicit hints are different from what the implicit
> branch predictors would predict anyways. And if you look at the
> heuristics that is not often the case...

But a likely branch will be _strongly_ predicted to be taken,
wheras a lot of the gcc heuristics simply have slightly more or
slightly less probability. So it's not just a question of which
way is more likely, but also _how_ likely it is to go that way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-19 Thread Nick Piggin

On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
>
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.
>
> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
>
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.

This whole thing would be much better if you didn't rely on the page
lock at all, but either a) used the same locking as Linux does for its
ptes/tlbs, or b) have some locking that is private to the mmu notifier
code. Then there is not all this new stuff that has to be understood in
the core VM.

Also, why do you have to "invalidate" ranges when switching to a
_more_ permissive state? This stuff should basically be the same as
(a subset of) the TLB flushing API AFAIKS. Anything more is a pretty
big burden to put in the core VM.

See my alternative patch I posted -- I can't see why it won't work
just like a TLB.

As far as sleeping inside callbacks goes... I think there are big
problems with the patch (the sleeping patch and the external rmap
patch). I don't think it is workable in its current state. Either
we have to make some big changes to the core VM, or we have to turn
some locks into sleeping locks to do it properly AFAIKS. Neither
one is good.

But anyway, I don't really think the two approaches (Andrea's
notifiers vs sleeping/xrmap) should be tangled up too much. I
think Andrea's can possibly be quite unintrusive and useful very
soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-19 Thread Nick Piggin

On Sunday 17 February 2008 06:22, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:

> > >   flush_cache_page(vma, address, pte_pfn(*pte));
> > >   entry = ptep_clear_flush(vma, address, pte);
> > > + mmu_notifier(invalidate_page, mm, address);
> >
> > I just don't see how ths can be done if the callee has another thread in
> > the middle of establishing IO against this region of memory.
> > ->invalidate_page() _has_ to be able to block.  Confused.
>
> The page lock is held and that holds off I/O?

I think the actual answer is that "it doesn't matter".

ptes are not exactly the entity via which IO gets established, so
all we really care about here is that after the callback finishes,
we will not get any more reads or writes to the page via the
external mapping.

As far as holding off local IO goes, that is the job of the core
VM. (And no, page lock does not necessarily hold it off FYI -- it
can be writeback IO or even IO directly via buffers).

Holding off IO via the external references I guess is a job for
the notifier driver.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] my mmu notifier sample driver

2008-02-19 Thread Nick Piggin


Index: linux-2.6/drivers/char/mmu_notifier_skel.c
===
--- /dev/null
+++ linux-2.6/drivers/char/mmu_notifier_skel.c
@@ -0,0 +1,255 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static DEFINE_SPINLOCK(mmn_lock);
+static RADIX_TREE(rmap_tree, GFP_ATOMIC);
+static seqcount_t rmap_seq = SEQCNT_ZERO;
+
+static int __rmap_add(unsigned long mem, unsigned long vaddr)
+{
+   int err;
+
+   err = radix_tree_insert(&rmap_tree, mem >> PAGE_SHIFT, (void *)vaddr);
+
+   return err;
+}
+
+static void __rmap_del(unsigned long mem)
+{
+   void *ret;
+
+   ret = radix_tree_delete(&rmap_tree, mem >> PAGE_SHIFT);
+   BUG_ON(!ret);
+}
+
+static unsigned long rmap_find(unsigned long mem)
+{
+   unsigned long vaddr;
+
+   rcu_read_lock();
+   vaddr = (unsigned long)radix_tree_lookup(&rmap_tree, mem >> PAGE_SHIFT);
+   rcu_read_unlock();
+
+   return vaddr;
+}
+
+static struct page *follow_page_atomic(struct mm_struct *mm, unsigned long 
address, int write)
+{
+   struct vm_area_struct *vma;
+
+   vma = find_vma(mm, address);
+if (!vma || (vma->vm_start > address))
+return NULL;
+
+   if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+   return NULL;
+
+   return follow_page(vma, address, FOLL_GET|(write ? FOLL_WRITE : 0));
+}
+
+static int mmn_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long source_vaddr = (unsigned long)vmf->pgoff << PAGE_SHIFT;
+   unsigned long dest_vaddr = (unsigned long)vmf->virtual_address;
+   unsigned long pfn;
+   struct page *page;
+   pgprot_t prot;
+   int write = vmf->flags & FAULT_FLAG_WRITE;
+   int ret;
+
+   printk("mmn_vm_fault [EMAIL PROTECTED] sourcing from %lx\n", write ? 
"write" : "read", dest_vaddr, source_vaddr);
+
+   BUG_ON(mm != current->mm); /* disallow get_user_pages */
+
+again:
+   spin_lock(&mmn_lock);
+   write_seqcount_begin(&rmap_seq);
+   page = follow_page_atomic(mm, source_vaddr, write);
+   if (unlikely(!page)) {
+   write_seqcount_end(&rmap_seq);
+   spin_unlock(&mmn_lock);
+   ret = get_user_pages(current, mm, source_vaddr,
+   1, write, 0, &page, NULL);
+   if (ret != 1)
+   goto out_err;
+   put_page(page);
+   goto again;
+   }
+
+   ret = __rmap_add(source_vaddr, dest_vaddr);
+   if (ret)
+   goto out_lock;
+
+   pfn = page_to_pfn(page);
+   prot = vma->vm_page_prot;
+   if (!write)
+   vma->vm_page_prot = vm_get_page_prot(vma->vm_flags & 
~(VM_WRITE|VM_MAYWRITE));
+   ret = vm_insert_pfn(vma, dest_vaddr, pfn);
+   vma->vm_page_prot = prot;
+   if (ret) {
+   if (ret == -EBUSY)
+   WARN_ON(1);
+   goto out_rmap;
+   }
+   write_seqcount_end(&rmap_seq);
+   spin_unlock(&mmn_lock);
+   put_page(page);
+
+return VM_FAULT_NOPAGE;
+
+out_rmap:
+   __rmap_del(source_vaddr);
+out_lock:
+   write_seqcount_end(&rmap_seq);
+   spin_unlock(&mmn_lock);
+   put_page(page);
+out_err:
+   switch (ret) {
+   case -EFAULT:
+   case -EEXIST:
+   case -EBUSY:
+   return VM_FAULT_SIGBUS;
+   case -ENOMEM:
+   return VM_FAULT_OOM;
+   default:
+   BUG();
+   }
+}
+
+struct vm_operations_struct mmn_vm_ops = {
+.fault = mmn_vm_fault,
+};
+
+static int mmu_notifier_busy;
+static struct mmu_notifier mmu_notifier;
+
+static int mmn_clear_young(struct mmu_notifier *mn, unsigned long address)
+{
+   unsigned long vaddr;
+   unsigned seq;
+   struct mm_struct *mm = mn->mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep, pte;
+
+   do {
+   seq = read_seqcount_begin(&rmap_seq);
+   vaddr = rmap_find(address);
+   } while (read_seqcount_retry(&rmap_seq, seq));
+
+   if (vaddr == 0)
+   return 0;
+
+   printk("[EMAIL PROTECTED] sourced from %lx\n", vaddr, address);
+
+   spin_lock(&mmn_lock);
+pgd = pgd_offset(mm, vaddr);
+pud = pud_offset(pgd, vaddr);
+   if (pud) {
+   pmd = pmd_offset(pud, vaddr);
+   if (pmd) {
+   ptep = pte_offset_map(pmd, vaddr);
+   if (ptep) {
+   pte = *ptep;
+   if (!pte_present(pte)) {
+   /* x86 specific, don't have a vma */
+   ptep_get_and_clear(mm, vaddr, ptep);
+   __flush_tlb_one(vaddr);
+

[patch] my mmu notifiers

2008-02-19 Thread Nick Piggin

Well I started reviewing the mmu notifier code, but it is kind of hard to
know what you're talking about just by reading through code and not trying
your suggestions for yourself...

So I implemented mmu notifiers slightly differently. Andrea's mmu notifiers
are rather similar. However I have tried to make a point of minimising the
impact the the core mm/. I don't see why we need to invalidate or flush
anything when changing the pte to be _more_ permissive, and I don't
understand the need for invalidate_begin/invalidate_end pairs at all.
What I have done is basically create it so that the notifiers get called
basically in the same place as the normal TLB flushing is done, and nowhere
else.

I also wanted to avoid calling notifier code from inside eg. hardware TLB
or pte manipulation primitives. These things are already pretty well
spaghetti, so I'd like to just place them right where needed first... I
think eventually it will need a bit of a rethink to make it more consistent
and more general. But I prefer to do put them in the caller for the moment.

I have also attempted to write a skeleton driver. Not like Christoph's
drivers, but one that actually does something. This one can mmap a
window into its own virtual address space. It's not perfect yet (I need
to replace page_mkwrite with ->fault in the core mm before I can get
enough information to do protection properly I think). However I think it
may be race-free in the fault vs unmap paths. It's pretty complex, I must
say.

---

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -228,6 +228,9 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===
--- /dev/null
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -0,0 +1,69 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include 
+#include 
+
+struct mmu_notifier;
+struct mmu_notifier_operations;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier {
+   struct hlist_node hlist;
+   const struct mmu_notifier_operations *ops;
+   struct mm_struct *mm;
+};
+
+struct mmu_notifier_operations {
+   void (*release)(struct mmu_notifier *mn);
+   int (*clear_young)(struct mmu_notifier *mn, unsigned long address);
+   void (*unmap)(struct mmu_notifier *mn, unsigned long address);
+   void (*invalidate_range)(struct mmu_notifier *mn, unsigned long start, 
unsigned long end);
+};
+
+static inline void mmu_notifier_init_mm(struct mm_struct *mm)
+{
+   INIT_HLIST_HEAD(&mm->mmu_notifier_list);
+}
+
+static inline void mmu_notifier_init(struct mmu_notifier *mn, const struct 
mmu_notifier_operations *ops, struct mm_struct *mm)
+{
+   INIT_HLIST_NODE(&mn->hlist);
+   mn->ops = ops;
+   mn->mm = mm;
+}
+
+extern void mmu_notifier_register(struct mmu_notifier *mn);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn);
+
+extern void mmu_notifier_exit_mm(struct mm_struct *mm);
+extern int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long 
address);
+extern void mmu_notifier_unmap(struct mm_struct *mm, unsigned long address);
+extern void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long 
start, unsigned long end);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_exit_mm(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long 
address)
+{
+   return 0;
+}
+
+static inline void mmu_notifier_unmap(struct mm_struct *mm, unsigned long 
address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, 
unsigned long start, unsigned long end)
+{
+}
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif
Index: linux-2.6/kernel/fork.c
===
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -358,6 +359,7 @@ static struct mm_struct * mm_init(struct
mm->ioctx_list = NULL;
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
+   mmu_notifier_init_mm(mm);
mm_init_cgroup(mm, p);
 
if (likely(!mm_alloc_pgd(mm))) {
Index: linux-2.6/mm/filemap_xip.c
===
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -195,6 +195,7 @@ __xip_unmap (struct address_space * mapp
/* Nuke the page table entry. */
flush_cache_page(vma, addre

Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-19 Thread Nick Piggin

On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote:
> background
> 
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
>   - heavy lock contention
>   - unnecessary swap out
>
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
> reclaim throttle and explain it improve reclaim time.
>   http://marc.info/?l=linux-mm&m=119667465917215&w=2
>
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
>
>
> test machine, method and result
> ==
> 
>   CPU:  IA64 x8
>   MEM:  8GB
>   SWAP: 2GB
>
> 
>   got hackbench from
>   http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
>
>   $ /usr/bin/time hackbench 120 process 1000
>
>   this parameter mean consume all physical memory and
>   1GB swap space on my test environment.
>
> 
>
> before:
>   hackbench result:   282.30
>   /usr/bin/time result
>   user:   14.16
>   sys:1248.47
>   elapse: 432.93
>   major fault:29026
>   max parallel reclaim tasks: 1298
>   max consumption time of
>try_to_free_pages():   70394
>
> after:
>   hackbench result:   30.36
>   /usr/bin/time result
>   user:   14.26
>   sys:294.44
>   elapse: 118.01
>   major fault:3064
>   max parallel reclaim tasks: 4
>   max consumption time of
>try_to_free_pages():   12234
>
>
> conclusion
> =
> this patch improve 3 things.
> 1. reduce unnecessary swap
>(see above major fault. about 90% reduced)
> 2. improve throughput performance
>(see above hackbench result. about 90% reduced)
> 3. improve interactive performance.
>(see above max consumption of try_to_free_pages.
> about 80% reduced)
> 4. reduce lock contention.
>(see above sys time. about 80% reduced)
>
>
> Now, we got about 1000% performance improvement of hackbench :)
>
>
>
> foture works
> ==
>  - more discussion with memory controller guys.

Hi,

Yeah this is definitely needed and a nice result.

I'm worried about a) placing a global limit on parallelism, and b)
placing a limit on parallelism at all.

I think it should maybe be a per-zone thing...

What happens if you make it a per-zone mutex, and allow just a single
process to reclaim pages from a given zone at a time? I guess that is
going to slow down throughput a little bit in some cases though...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, RFC] kthread: (possibly) a missing memory barrier in kthread_stop()

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 10:03, Dmitry Adamushko wrote:
> Hi,
>
>
> [ description ]
>
> Subject: kthread: add a memory barrier to kthread_stop()
>
> 'kthread' threads do a check in the following order:
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> and set_current_state() implies an smp_mb().
>
> on another side (kthread_stop), wake_up_process() does not seem to
> guarantee a full mb.
>
> And 'kthread_stop_info.k' must be visible before wake_up_process()
> checks for/modifies a state of the 'kthread' task.
>
> (the patch is at the end of the message)
>
>
> [ more detailed description ]
>
> the current code might well be safe in case a to-be-stopped 'kthread'
> task is _not_ running on another CPU at the moment when kthread_stop()
> is called (in this case, 'rq->lock' will act as a kind of synch.
> point/barrier).
>
> Another case is as follows:
>
> CPU#0:
>
> ...
> while (kthread_should_stop()) {
>
>if (condition)
>  schedule();
>
>/* ... do something useful ... */   <--- EIP
>
>set_current_state(TASK_INTERRUPTIBLE);
> }
>
> so a 'kthread' task is about to call
> set_current_state(TASK_INTERRUPTIBLE) ...
>
>
> (in the mean time)
>
> CPU#1:
>
> kthread_stop()
>
> -> kthread_stop_info.k = k (*)
> -> wake_up_process()
>
> wake_up_process() looks like:
>
> (try_to_wake_up)
>
> IRQ_OFF
> LOCK
>
> old_state = p->state;
> if (!(old_state & state))  (**)
>  goto out;
>
> ...
>
> UNLOCK
> IRQ_ON
>
>
> let's suppose (*) and (**) are reordered
> (according to Documentation/memory-barriers.txt, neither IRQ_OFF nor
> LOCK may prevent it from happening).
>
> - the state is TASK_RUNNING, so we are about to return.
>
> - CPU#1 is about to execute (*) (it's guaranteed to be done before
> spin_unlock(&rq->lock) at the end of try_to_wake_up())
>
>
> (in the mean time)
>
> CPU#0:
>
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> here, kthread_stop_info.k is not yet visible
>
> - schedule()
>
> ...
>
> we missed a 'kthread_stop' event.
>
> hum?

Looks like you are correct to me.


> TIA,
>
> ---
>
> From: Dmitry Adamushko <[EMAIL PROTECTED]>
> Subject: kthread: add a memory barrier to kthread_stop()
>
> 'kthread' threads do a check in the following order:
> - set_current_state(TASK_INTERRUPTIBLE);
> - kthread_should_stop();
>
> and set_current_state() implies an smp_mb().
>
> on another side (kthread_stop), wake_up_process() is not guaranteed to
> act as a full mb.
>
> 'kthread_stop_info.k' must be visible before wake_up_process() checks
> for/modifies a state of the 'kthread' task.
>
>
> Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>
>
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 0ac8878..5167110 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -211,6 +211,10 @@ int kthread_stop(struct task_struct *k)
>
>   /* Now set kthread_should_stop() to true, and wake it up. */
>   kthread_stop_info.k = k;
> +
> + /* The previous store operation must not get ahead of the wakeup. */
> + smp_mb();
> +
>   wake_up_process(k);
>   put_task_struct(k);
>
>
>
> --

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 16:58, Willy Tarreau wrote:
> On Tue, Feb 19, 2008 at 01:33:53PM +1100, Nick Piggin wrote:
> > > Note in particular the last predictors; assuming branch ending
> > > with goto, including call, causing early function return or
> > > returning negative constant are not taken. Just these alone
> > > are likely 95+% of the unlikelies in the kernel.
> >
> > Yes, gcc should be able to do pretty good heuristics, considering
> > the quite good numbers that cold CPU predictors can attain. However
> > for really performance critical code (or really "never" executed
> > code), then I think it is OK to have the hints and not have to rely
> > on gcc heuristics.
>
> in my experience, the real problem is that gcc does what *it* wants and not
> what *you* want. I've been annoyed a lot by the way it coded some loops
> that could really be blazingly fast, but which resulted in a ton of
> branches due to its predictors. And using unlikely() there was a real mess,
> because instead of just hinting the compiler with probabilities to write
> some linear code for the *most* common case, it ended up with awful
> branches everywhere with code sent far away and even duplicated for some
> branches.
>
> Sometimes, for performance critical paths, I would like gcc to be dumb and
> follow *my* code and not its hard-coded probabilities. For instance, in a
> tree traversal, you really know how you want to build your loop. And these
> days, it seems like the single method of getting it your way is doing asm,
> which obviously is not portable :-(

Probably all true.


> Maybe one thing we would need would be the ability to assign probabilities
> to each branch based on what we expect, so that gcc could build a better
> tree keeping most frequently used code tight.

I don't know if that would *directly* lead to gcc being smarter. I
think perhaps they probably don't benchmark on code bases that have
much explicit annotation (I'm sure they wouldn't seriously benchmark
any parts of Linux as part of daily development). I think the key is
to continue to use annotations _properly_, and eventually gcc should
go in the right direction if enough code uses it.

And if you have really good examples like it sounds like above, then
I guess that should be reported to gcc?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 13:40, Arjan van de Ven wrote:
> On Tue, 19 Feb 2008 13:33:53 +1100
>
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> > Actually one thing I don't like about gcc is that I think it still
> > emits cmovs for likely/unlikely branches, which is silly (the gcc
> > developers seem to be in love with that instruction). If that goes
> > away, then branch hints may be even better.
>
> only for -Os and only if the result is smaller afaik.

What is your evidence for saying this? Because here, with the latest
kernel and recent gcc-4.3 snapshot, it spits out cmov like crazy even
when compiled with -O2.

[EMAIL PROTECTED]:~/usr/src/linux-2.6$ grep cmov kernel/sched.s | wc -l
45

And yes it even does for hinted branches and even at -O2/3

[EMAIL PROTECTED]:~/tests$ cat cmov.c
int test(int a, int b)
{
if (__builtin_expect(a < b, 0))
return a;
else
return b;
}
[EMAIL PROTECTED]:~/tests$ gcc-4.3 -S -O2 cmov.c
[EMAIL PROTECTED]:~/tests$ head -13 cmov.s
.file   "cmov.c"
.text
.p2align 4,,15
..globl test
.type   test, @function
test:
..LFB2:
cmpl%edi, %esi
cmovle  %esi, %edi
movl%edi, %eax
ret
..LFE2:
.size   test, .-test

This definitely should be a branch, IMO.

> (cmov tends to be a performance loss most of the time so for -O2 and such
> it isn't used as far as I know.. it does make for nice small code however
> ;-)

It shouldn't be hard to work out the cutover point based on how
expensive cmov is, how expensive branch and branch mispredicts are,
and how often the branch is likely to be mispredicted. For an
unpredictable branch, cmov is normally quite a good win even on
modern CPUs. But gcc overuses it I think.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-18 Thread Nick Piggin

On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:
> On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote:

> > > [ 4057.31] Pid: 7035, comm: ftest03 Not tainted
> > > 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #2 [ 4057.31] RIP:
> > > 0010:[]  []
> > > iov_iter_advance+0x38/0x70 [ 4057.31] RSP: 0018:810110329b20 
> > > EFLAGS: 00010246
> > > [ 4057.31] RAX:  RBX: 0800 RCX:
> > >  [ 4057.31] RDX:  RSI:
> > > 0800 RDI: 810110329ba8 [ 4057.31] RBP:
> > > 0800 R08:  R09: 810101dbc000 [
> > > 4057.31] R10: 0004 R11:  R12:
> > > 00026000 [ 4057.31] R13: 81010d765c98 R14:
> > > 1000 R15:  [ 4057.31] FS: 
> > > 7fee589146d0() GS:80501000() knlGS:
> > > [ 4057.31] CS:  0010 DS:  ES:  CR0: 8005003b [
> > > 4057.31] CR2: 810101dbc008 CR3: 0001103da000 CR4:
> > > 06e0 [ 4057.31] DR0:  DR1:
> > >  DR2:  [ 4057.31] DR3:
> > >  DR6: 0ff0 DR7: 0400 [
> > > 4057.31] Process ftest03 (pid: 7035, threadinfo 810110328000,
> > > task 810160b0) [ 4057.31] Stack:  8025b413
> > > 81010d765ab0 804e6fd8 001201d2 [ 4057.31] 
> > > 810110329db8 00026000 810110329d38 81017b9fb500 [
> > > 4057.31]  81010d765c98 804175e0 81010d765ab0
> > >  [ 4057.31] Call Trace:
> > > [ 4057.31]  [] ?
> > > generic_file_buffered_write+0x1e3/0x6f0 [ 4057.31] 
> > > [] ? current_fs_time+0x1e/0x30 [ 4057.31] 
> > > [] ? __generic_file_aio_write_nolock+0x28f/0x440 [
> > > 4057.31]  [] ? generic_file_aio_write+0x63/0xd0 [
> > > 4057.31]  [] ? ext3_file_write+0x23/0xc0 [
> > > 4057.31]  [] ? ext3_file_write+0x0/0xc0 [
> > > 4057.31]  [] ? do_sync_readv_writev+0xcb/0x110 [
> > > 4057.31]  [] ? autoremove_wake_function+0x0/0x30
> > > [ 4057.31]  [] ?
> > > debug_check_no_locks_freed+0x7d/0x130 [ 4057.31] 
> > > [] ? trace_hardirqs_on+0xcf/0x150 [ 4057.31] 
> > > [] ? __kmalloc+0x15/0xc0
> > > [ 4057.31]  [] ? rw_copy_check_uvector+0x9d/0x130
> > > [ 4057.31]  [] ? do_readv_writev+0xe0/0x170
> > > [ 4057.31]  [] ? mutex_lock_nested+0x1a7/0x280
> > > [ 4057.31]  [] ? trace_hardirqs_on+0xcf/0x150
> > > [ 4057.31]  [] ?
> > > __mutex_unlock_slowpath+0xc9/0x170 [ 4057.31]  []
> > > ? trace_hardirqs_on+0xcf/0x150 [ 4057.31]  [] ?
> > > trace_hardirqs_on_thunk+0x35/0x3a [ 4057.31]  []
> > > ? sys_writev+0x53/0x90
> > > [ 4057.31]  [] ?
> > > system_call_after_swapgs+0x7b/0x80 [ 4057.31]
> > > [ 4057.31]
> > > [ 4057.31] Code: 48 01 77 10 48 29 77 18 c3 0f 0b eb fe 66 66 90 66
> > > 66 90 4c 8b 0f 48 8b 4f 10 49 89 f0 eb 07 66 66 66 90 49 29 c0 4d 85 c0
> > > 75 07 <49> 83 79 08 00 75 23 49 8b 51 08 48 89 d0 48 29 c8 49 39 c0 49
> > > [ 4057.31] RIP  [] iov_iter_advance+0x38/0x70 [
> > > 4057.31]  RSP 
> > > [ 4057.31] CR2: 810101dbc008
> > > [ 4057.31] Kernel panic - not syncing: Fatal exception

Can you try this patch please?
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1753,9 +1753,10 @@ static void __iov_iter_advance_iov(struc
 
 		/*
 		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments. But we mustn't try to "skip" if
+		 * we have come to the end (i->count == bytes).
 		 */
-		while (bytes || !iov->iov_len) {
+		while (bytes || (unlikely(!iov->iov_len) && i->count > bytes)) {
 			int copy = min(bytes, iov->iov_len - base);
 
 			bytes -= copy;

Re: [PATCH 1/3] Fix Unlikely(x) == y

2008-02-18 Thread Nick Piggin

On Tuesday 19 February 2008 01:39, Andi Kleen wrote:
> Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > you have more faith in the authors knowledge of how his code actually
> > behaves than I think is warranted  :)
>
> iirc there was a mm patch some time ago to keep track of the actual
> unlikely values at runtime and it showed indeed some wrong ones. But the
> far majority of them are probably correct.
>
> > Or faith in that he knows what "unlikely" means.
> > I should write docs about this; but unlikely() means:
> > 1) It happens less than 0.01% of the cases.
> > 2) The compiler couldn't have figured this out by itself
> >(NULL pointer checks are compiler done already, same for some other
> > conditions) 3) It's a hot codepath where shaving 0.5 cycles (less even on
> > x86) matters (and the author is ok with taking a 500 cycles hit if he's
> > wrong)
>
> One more thing unlikely() does is to move the unlikely code out of line.
> So it should conserve some icache in critical functions, which might
> well be worth some more cycles (don't have numbers though).

I actually once measured context switching performance in the scheduler,
and removing the  unlikely hint for testing RT tasks IIRC gave about 5%
performance drop.

This was on a P4 which is very different from more modern CPUs both in
terms of branch performance characteristics, and icache characteristics.
However, the P4's branch predictor is pretty good, and it should easily
be able to correctly predict the rt_task check if it has enough entries.
So I think much of the savings came from code transformation and movement.
Anyway, it is definitely worthwhile if used correctly.

Actually one thing I don't like about gcc is that I think it still emits
cmovs for likely/unlikely branches, which is silly (the gcc developers
seem to be in love with that instruction). If that goes away, then
branch hints may be even better.

>
> But overall I agree with you that unlikely is in most cases a bad
> idea (and I submitted the original patch introducing it originally @). That
> is because it is often used in situations where gcc's default branch
> prediction heuristics do would make exactly the same decision
>
>if (unlikely(x == NULL))
>
> is simply totally useless because gcc already assumes all x == NULL
> tests are unlikely. I appended some of the builtin heuristics from
> a recent gcc source so people can see them.
>
> Note in particular the last predictors; assuming branch ending
> with goto, including call, causing early function return or
> returning negative constant are not taken. Just these alone
> are likely 95+% of the unlikelies in the kernel.

Yes, gcc should be able to do pretty good heuristics, considering
the quite good numbers that cold CPU predictors can attain. However
for really performance critical code (or really "never" executed
code), then I think it is OK to have the hints and not have to rely
on gcc heuristics.

>
> -Andi

[snip]

Interesting, thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queueing and complete affinity w/ threads: Some results

2008-02-18 Thread Nick Piggin

On Mon, Feb 18, 2008 at 02:33:17PM +0100, Andi Kleen wrote:
> Jens Axboe <[EMAIL PROTECTED]> writes:
> 
> > and that scrapping the remote
> > softirq trigger stuff is sanest.
> 
> I actually liked Nick's queued smp_function_call_single() patch. So even
> if it was not used for block I would still like to see it being merged 
> in some form to speed up all the other IPI users.

Yeah, that hasn't been forgotten (nor have your comments about folding
my special function into smp_call_function_single).

The call function path is terribly unscalable at the moment on a lot
of architectures, and also it isn't allowed to be used with interrupts
off due to deadlock (which the queued version can allow, provided
that wait=0).

I will get around to sending that upstream soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LatencyTOP: sync_page description

2008-02-17 Thread Nick Piggin

On Saturday 16 February 2008 08:56, Török Edwin wrote:
> Hi Arjan,
>
> LatencyTOP says that sync_page is 'Writing a page to disk', however
> I see that even when no writes are involved, such as during a
> readdir, lseek, etc.
> Naming it a write is misleading, as no program is running that is
> doing constant writes to the disk. The only program is writing to a
> temp dir in /dev/shm.
>
> What would be a better description for sync_page?

Waiting on a page state change (usually: waiting for IO, but can be
also waiting for the page lock which is taken by some other part of
the kernel eg in page reclaim, truncate, buffered writes, page
faults).

> Here are some /proc/latency_stats containing sync_page:
>
> 125 6937678 210821 sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> get_unused_fd_flags path_walk do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area
> 306 5677749 215746 sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> do_sync_read autoremove_wake_function security_file_permission
> rw_verify_area vfs_read vfs_llseek
> 21 435657 59966 sync_page sync_page __lock_page wake_bit_function
> read_cache_page_async ntfs_readpage read_cache_page map_mft_record
> ntfs_read_locked_inode ntfs_alloc_big_inode iget5_locked
> ntfs_test_inode
> 195 2716409 133660 blk_unplug sync_page sync_page __lock_page
> wake_bit_function read_cache_page_async ntfs_readpage
> read_cache_page map_mft_record ntfs_read_locked_inode
> ntfs_alloc_big_inode iget5_locked
> 28 1881278 181986 add_to_page_cache_lru sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read get_unused_fd_flags path_walk do_sync_read
> autoremove_wake_function security_file_permission
> 2 17132 9746 add_to_page_cache_lru sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area vfs_read
> 1 70 70 irq_exit sync_page sync_page_killable sync_page_killable
> __lock_page_killable wake_bit_function generic_file_aio_read
> do_sync_read autoremove_wake_function security_file_permission
> rw_verify_area vfs_read
> 23 306682 114514 blk_unplug sync_page sync_page_killable
> sync_page_killable __lock_page_killable wake_bit_function
> generic_file_aio_read do_sync_read autoremove_wake_function
> security_file_permission rw_verify_area vfs_read
> 1 153 153 hrtimer_interrupt smp_apic_timer_interrupt sync_page
> sync_page_killable sync_page_killable __lock_page_killable
> wake_bit_function generic_file_aio_read do_sync_read
> autoremove_wake_function cfq_idle_slice_timer security_file_permission

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/6] mmu_notifier: invalidate_page callbacks

2008-02-17 Thread Nick Piggin

On Saturday 16 February 2008 14:37, Andrew Morton wrote:
> On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter <[EMAIL PROTECTED]> 
wrote:
> > Two callbacks to remove individual pages as done in rmap code
> >
> > invalidate_page()
> >
> > Called from the inner loop of rmap walks to invalidate pages.
> >
> > age_page()
> >
> > Called for the determination of the page referenced status.
> >
> > If we do not care about page referenced status then an age_page callback
> > may be be omitted. PageLock and pte lock are held when either of the
> > functions is called.
>
> The age_page mystery shallows.

BTW. can this callback be called mmu_notifier_clear_flush_young? To
match the core VM.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-13 Thread Nick Piggin

On Wednesday 13 February 2008 20:32, Andrew Morton wrote:
> On Wed, 13 Feb 2008 20:24:03 +1100 Nick Piggin <[EMAIL PROTECTED]> 
wrote:
> > BTW is it really true that the buffer can never be locked by
> > anything else at this point?
>
> It has been for the past five or six years.  With the page locked, nobody
> else can get at that page.

Hmm OK.


> > What about fsync_buffers_list?
>
> They're metadata buffers, not regular file data.  Things might get ugly if
> IO to /dev/sda went via that path, but it doesn't.

Yeah right... so the BUG_ON is basically because you want to avoid
the overhead of locking the buffer (which would presumably allow it
to work in situations where someone else might lock the buffer without
locking the page?). OK, makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-13 Thread Nick Piggin

On Wednesday 13 February 2008 20:01, Andrew Morton wrote:
> On Wed, 13 Feb 2008 08:26:27 +0100 Bart Dopheide <[EMAIL PROTECTED]> wrote:
> > On Wed, Feb 13, 2008 at 12:05:45PM +1100, Nick Piggin wrote:
> > :)On Wednesday 13 February 2008 08:50, Alan Cox wrote:
> > :)> Almost certainly a hardware fail of some sort.
> > :)
> > :)Right, but the kernel shouldn't go bug...
> >
> > Indeed, that's why I'm reporting.
> >
> > :)I don't have a copy of your exact source code... which condition in
> > :)__mpage_writepage went BUG?
> >
> > BUG_ON(buffer_locked(bh));
> >
> > In a bit of context:
> > 482:if (page_has_buffers(page)) {
> > 483:struct buffer_head *head = page_buffers(page);
> > 484:struct buffer_head *bh = head;
> > 485:
> > 486:/* If they're all mapped and dirty, do it */
> > 487:page_block = 0;
> > 488:do {
> > 489:BUG_ON(buffer_locked(bh));
> > 490:if (!buffer_mapped(bh)) {
> > 491:/*
> > 492: * unmapped dirty buffers are created by
> > 493: * __set_page_dirty_buffers -> mmapped
> > data 494: */
> > 495:if (buffer_dirty(bh))
> > 496:goto confused;
> > 497:if (first_unmapped == blocks_per_page)
> > 498:first_unmapped = page_block;
> > 499:continue;
> > 500:}
>
> Probably means that either fat, IDE, block or fs/buffer.c failed to unlock
> a buffer_head when the IO error happened.  It's unlikely to be fat.

Yes that looks like it would be the problem. I can't really
see anything in buffer.c that would do it... 

BTW is it really true that the buffer can never be locked by
anything else at this point? What about fsync_buffers_list?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 17:06, Max Krasnyansky wrote:
> Nick Piggin wrote:

> > But don't let me dissuade you from making these good improvements
> > to Linux as well :) Just that it isn't really going to be hard-rt
> > in general.
>
> Actually that's the cool thing about CPU isolation. Get rid of all latency
> sources from the CPU(s) and you get youself as hard-RT as it gets.

Hmm, maybe. Removing all sources of latency from the CPU kind of
implies that you have to audit the whole kernel for source of
latency.

> I mean I _already_ have multi-core hard-RT systems that show ~1.2 usec
> worst case and ~200nsec average latency. I do not even need Adeos/Xenomai
> or Preemp-RT just a few very small patches. And it can be used for non RT
> stuff too.

OK, but you then are very restricted in what you can do, and easily
can break it especially if you run any userspace on that CPU. If
you just run a kernel module that, after setup, doesn't use any
other kernel resources except interrupt handling, then you might be
OK (depending on whether even interrupt handling can run into
contended locks)...

If you started doing very much more, then you can easily run into
trouble.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ALPHA] ES40 fails to boot with >=kernel 2.6.23

2008-02-12 Thread Nick Piggin

On Tuesday 12 February 2008 04:27, Raúl Porcel wrote:
> Hi,
>
> We have a Compaq AlphaServer ES40 and since 2.6.23 it won't boot. I'm
> attaching the console log and the kernel config.
>
> Need to say that with a DEC Xp1000 it works fine, although they're
> different machines, of course.
> With .22 it boots fine, and by booting fine i mean after we reverted to
> 2.6.22 it booted again and everything worked as expected.
> Still hangs with latest kernel.
>
> I'm attaching the verlinux output as well, hope it helps. If i'm missing
> something, please don't hesitate to ask.
>
> Thanks

Hi,

Thanks for reporting. I'm not an alpha person, but I have
cc'ed them in case they missed this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 11:17, Nick Piggin wrote:
> On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:

> > It's a trivial dumb module which does nothing but loads and unloads.
> > I redid ftest03 later without any suspicious activity and it oopsed the
> > same way.
>
> Ah crap. Hmm, maybe I didn't consider all cases with my last patch to
> that code... is there an easy way to get the ftest03 source and run
> it?

OK I didn't realise it is a test from ltp.

But I can't reproduce it for the life of me with the latest git kernel
and latest ltp tarball.

Is it easy to reproduce? Are you reproducing it simply by running the
ftest03 binary directly from the shell? How many times between oopses?
It is multi-process but no threads, so races should be minimal down
this path -- can you get an strace of the failing process?

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote:
> David Miller wrote:
> > From: Nick Piggin <[EMAIL PROTECTED]>
> > Date: Tue, 12 Feb 2008 17:41:21 +1100
> >
> >> stop machine is used for more than just module loading and unloading.
> >> I don't think you can just disable it.
> >
> > Right, in particular it is used for CPU hotplug.
>
> Ooops. Totally missed that. And a bunch of other places.
>
> [EMAIL PROTECTED] cpuisol-2.6.git]$ git grep -l stop_machine_run
> Documentation/cpu-hotplug.txt
> arch/s390/kernel/kprobes.c
> drivers/char/hw_random/intel-rng.c
> include/linux/stop_machine.h
> kernel/cpu.c
> kernel/module.c
> kernel/stop_machine.c
> mm/page_alloc.c
>
> I wonder why I did not see any issues when I disabled stop machine
> completely. I mentioned in the other thread that I commented out the part
> that actually halts the machine and ran it for several hours on my dual
> core laptop and on the quad core server. Tried all kinds of workloads,
> which include constant module removal and insertion, and cpu hotplug as
> well. It cannot be just luck :).

It really is. With subtle races, it can take a lot more than a few
hours. Consider that we have subtle races still in the kernel now,
which are almost never or rarely hit in maybe 10,000 hours * every
single person who has been using the current kernel for the past
year.

For a less theoretical example -- when I was writing the RCU radix
tree code, I tried to run directed stress tests on a 64 CPU Altix
machine (which found no bugs). Then I ran it on a dedicated test
harness that could actually do a lot more than the existing kernel
users are able to, and promptly found a couple more bugs (on a 2
CPU system).

But your primary defence against concurrency bugs _has_ to be
knowing the code and all its interactions.

> Clearly though, you guys are right. It cannot be simply disabled. Based on
> the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390
> and intel rng driver. Hopefully we can avoid it at least in module
> insertion/removal.

Yes, reducing the number of users by going through their code and
showing that it is safe, is the right way to do this. Also, you
could avoid module insertion/removal?

FWIW, I think the idea of trying to turn Linux into giving hard
realtime guarantees is just insane. If that is what you want, you
would IMO be much better off to spend effort with something like
improving adeos and communicatoin/administration between Linux and
the hard-rt kernel.

But don't let me dissuade you from making these good improvements
to Linux as well :) Just that it isn't really going to be hard-rt
in general.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2 resend] mm: various cleanups in get_user_pages()

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 00:10, Eugene Teo wrote:
> Sorry for the repeated emails. Kindly ignore the previous resend. Please
> review this instead. Thanks. I have tested this.

If it is causing this much problems, can you split the cleanups into
their own patches.


> [PATCH 2/2] mm: various cleanups in get_user_pages()
>
> This patch contains various cleanups, including making sure vma is valid,
> and the return value of follow_hugetlb_page() is validated.
>
> Signed-off-by: Eugene Teo <[EMAIL PROTECTED]>
> ---
>  mm/memory.c |   24 
>  1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 54f951b..c7e0610 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1003,7 +1003,9 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, unsigned int foll_flags;
>
>   vma = find_extend_vma(mm, start);
> - if (!vma && in_gate_area(tsk, start)) {
> + if (!vma)
> + goto finish_or_fault;
> + if (in_gate_area(tsk, start)) {
>   unsigned long pg = start & PAGE_MASK;
>   struct vm_area_struct *gate_vma = get_gate_vma(tsk);
>   pgd_t *pgd;

Doesn't this break the logic?

If you don't have a vma, but you are in the gate area, then you
should use the gate vma. With your patch, gate area will fault.

> @@ -1011,7 +1013,7 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, pmd_t *pmd;
>   pte_t *pte;
>   if (write) /* user gate pages are read-only */
> - return i ? : -EFAULT;
> + goto finish_or_fault;

I don't know if this is exactly a cleanup or not... I guess gcc
probably isn't smart enough to fold them all together, so it should
use a little less code in the unlikely branches. Does it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel BUG at fs/mpage.c:489

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 08:50, Alan Cox wrote:
> > Feb 12 19:55:08 butterfly kernel: hde: dma timeout error: status=0xd0 {
> > Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode was: unknown
>
> Your drive stopped responding.
>
> > Feb 12 19:55:08 butterfly kernel: hde: DMA disabled
> > Feb 12 19:55:08 butterfly kernel: PDC202XX: Primary channel reset.
> > Feb 12 19:55:08 butterfly kernel: PDC202XX: Secondary channel reset.
>
> We gave it a good kicking and it stayed offline
>
> > Feb 12 19:55:08 butterfly kernel: hde: set_drive_speed_status:
> > status=0xd0 { Busy } Feb 12 19:55:08 butterfly kernel: ide: failed opcode
> > was: unknown Feb 12 19:55:47 butterfly kernel: ide2: reset timed-out,
> > status=0xd0 Feb 12 19:55:47 butterfly kernel: hde: status timeout:
> > status=0xd0 { Busy }
>
> And we gave up.
>
> Almost certainly a hardware fail of some sort.

Right, but the kernel shouldn't go bug...

I don't have a copy of your exact source code... which condition in
__mpage_writepage went BUG?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-12 Thread Nick Piggin

On Wednesday 13 February 2008 09:27, Alexey Dobriyan wrote:
> On Tue, Feb 12, 2008 at 02:04:30PM -0800, Andrew Morton wrote:
> > On Sun, 10 Feb 2008 17:00:31 +0300
> >
> > Alexey Dobriyan <[EMAIL PROTECTED]> wrote:
> > > This happened during LTP. FWIW, modprobe/rmmod trivial empty module
> > > together with cat /proc/*/wchan and cat /proc/modules were also
> > > running.
> > >
> > > Box is E6400, much debugging is on, config below.
> > >
> > >
> > > [ 4057.31] BUG: unable to handle kernel paging request at
> > > 810101dbc008 [ 4057.31] IP: []
> > > iov_iter_advance+0x38/0x70 [ 4057.31] PGD 8063 PUD c063 PMD
> > > 153baa163 PTE 800101dbc160 [ 4057.31] Oops:  [1] SMP
> > > DEBUG_PAGEALLOC
> > > [ 4057.31] CPU 0
> > > [ 4057.31] Modules linked in: [last unloaded: foo]
> >
> > what is this foo.ko of which you speak, and did it wreck your kernel?
>
> It's a trivial dumb module which does nothing but loads and unloads.
> I redid ftest03 later without any suspicious activity and it oopsed the
> same way.

Ah crap. Hmm, maybe I didn't consider all cases with my last patch to
that code... is there an easy way to get the ftest03 source and run
it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull for -mm] CPU isolation extensions (updated2)

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 15:10, Max Krasnyansky wrote:

> Rusty - Stop machine.
>After doing a bunch of testing last three days I actually downgraded
> stop machine changes from [highly experimental] to simply [experimental].
> Pleas see this thread for more info:
> http://marc.info/?l=linux-kernel&m=120243837206248&w=2 Short story is that
> I ran several insmod/rmmod workloads on live multi-core boxes with stop
> machine _completely_ disabled and did no see any issues. Rusty did not get
> a chance to reply yet, I hopping that we'll be able to make "stop machine"
> completely optional for some configurations.

stop machine is used for more than just module loading and unloading.
I don't think you can just disable it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid buffer overflows in get_user_pages()

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 14:16, Robert Hancock wrote:
> Nick Piggin wrote:
> > On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote:
> >> Avoid buffer overflows in get_user_pages()
> >>
> >> So I spent a while pounding my head against my monitor trying to figure
> >> out the vmsplice() vulnerability - how could a failure to check for
> >> *read* access turn into a root exploit?  It turns out that it's a buffer
> >> overflow problem which is made easy by the way get_user_pages() is
> >> coded.
> >>
> >> In particular, "len" is a signed int, and it is only checked at the
> >> *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
> >> will execute once and decrement len to -1.  At that point, the loop will
> >> proceed until the next invalid address is found; in the process, it will
> >> likely overflow the pages array passed in to get_user_pages().
> >>
> >> I think that, if get_user_pages() has been asked to grab zero pages,
> >> that's what it should do.  Thus this patch; it is, among other things,
> >> enough to block the (already fixed) root exploit and any others which
> >> might be lurking in similar code.  I also think that the number of pages
> >> should be unsigned, but changing the prototype of this function probably
> >> requires some more careful review.
> >>
> >> Signed-off-by: Jonathan Corbet <[EMAIL PROTECTED]>
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index e5628a5..7f50fd8 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct
> >> mm_struct *mm, int i;
> >>unsigned int vm_flags;
> >>
> >> +  if (len <= 0)
> >> +  return 0;
> >
> > BUG_ON()?
>
> Well, not if the code involved in the exploit can pass a zero value,

Which is a bug, and you want to catch it.


> otherwise it's just turning it into a DoS..

If it is due to a security bug, then the fix is to fix the point
where the kernel starts trusting an untrusted value. Not to hide
the bug like this. Arguably, a BUG_ON is better in the case of a
security hole because you want to halt the process as soon as you
detect a problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid buffer overflows in get_user_pages()

2008-02-11 Thread Nick Piggin

On Tuesday 12 February 2008 10:17, Jonathan Corbet wrote:
> Avoid buffer overflows in get_user_pages()
>
> So I spent a while pounding my head against my monitor trying to figure
> out the vmsplice() vulnerability - how could a failure to check for
> *read* access turn into a root exploit?  It turns out that it's a buffer
> overflow problem which is made easy by the way get_user_pages() is
> coded.
>
> In particular, "len" is a signed int, and it is only checked at the
> *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
> will execute once and decrement len to -1.  At that point, the loop will
> proceed until the next invalid address is found; in the process, it will
> likely overflow the pages array passed in to get_user_pages().
>
> I think that, if get_user_pages() has been asked to grab zero pages,
> that's what it should do.  Thus this patch; it is, among other things,
> enough to block the (already fixed) root exploit and any others which
> might be lurking in similar code.  I also think that the number of pages
> should be unsigned, but changing the prototype of this function probably
> requires some more careful review.
>
> Signed-off-by: Jonathan Corbet <[EMAIL PROTECTED]>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e5628a5..7f50fd8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -989,6 +989,8 @@ int get_user_pages(struct task_struct *tsk, struct
> mm_struct *mm, int i;
>   unsigned int vm_flags;
>
> + if (len <= 0)
> + return 0;

BUG_ON()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops report for the week upto Feb 10th 2008

2008-02-10 Thread Nick Piggin

On Monday 11 February 2008 11:35, Arjan van de Ven wrote:
> The http://www.kerneloops.org website collects kernel oops and
> warning reports from various mailing lists and bugzillas as well as
> with a client users can install to auto-submit oopses.
> Below is a top 10 list of the oopses/backtraces collected in the last 7
> days. (Reports prior to 2.6.23 have been omitted in collecting the top 10)
>
> This week, a total of 323 oopses and warnings have been reported,
> compared to 110 reports in the previous week.
>
> (This sharp increase is due to Fedora 9 alpha shipping the oops data
> collection client in the default install, giving us much wider coverage
> in the issues that actual users hit; many thanks to the Fedora project
> for this)
>
> With the 2.6.25-rc1 release out, this will be the last report that includes
> 2.6.23; future reports will only include issues from 2.6.24 and later.
>
>
> Rank 1: set_dentry_child_flags
>   WARN_ON at fs/inotify.c:172 set_dentry_child_flags
>   Reported 93 times (116 total reports)
>   This is a user triggered WARN_ON in inotify. Sadly inotify seems to be
> unmaintained. More info:
> http://www.kerneloops.org/search.php?search=set_dentry_child_flags

I was never able to trigger this or get anyone to reliably trigger it with
a debug patch in. Which is why it has taken so long to fix. It looks like
kde4 is triggering this big rash of new reports.

Anyway, I have fixed a race or two and removed that warning code (which was
also a little racy). So I think that should be OK.


> Rank 9: mark_buffer_dirty
>   WARN_ON at fs/buffer.c:1169
>   This indicates that a non-uptodate buffer is marked dirty.
>   This can lead to data corruption!
>   Reported 5 times (12 total reports) - Only seen since 2.6.24-rc6
>   Usually happens during umount()
>   More info: http://www.kerneloops.org/search.php?search=mark_buffer_dirty

That's interesting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] block layer: kmemcheck fixes

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 02:56:09PM -0800, Arjan van de Ven wrote:
> Nick Piggin wrote:
> >>>Maybe cpus these days have so much store bandwith that doing
> >>>things like the above is OK, but I doubt it :-)
> >>on modern x86 cpus the memset may even be faster if the memory isn't in 
> >>cache;
> >>the "explicit" method ends up doing Write Allocate on the cache lines
> >>(so read them from memory) even though they then end up being written 
> >>entirely.
> >>With memset the CPU is told that the entire range is set to a new value, 
> >>and
> >>the WA can be avoided for the whole-cachelines in the range.
> >
> >Don't you have write combining store buffers? Or is it still speculatively
> >issuing the reads even before the whole cacheline is combined?
> 
> x86 memory order model doesn't allow that quite; and you need a "series" of 
> at least 64 bytes
> without any other memory accesses in between even if it would
> not happening in practice.

OK, fair enough... then it will be a very nice test to see if it
helps. I'm sure you could have an arch specific initialisation
function if it makes a significant difference.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] block layer: kmemcheck fixes

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 07:09:07AM -0800, Arjan van de Ven wrote:
> David Miller wrote:
> >From: Linus Torvalds <[EMAIL PROTECTED]>
> >Date: Thu, 7 Feb 2008 09:42:56 -0800 (PST)
> >
> >>Can we please just stop doing these one-by-one assignments, and just do 
> >>something like
> >>
> >>memset(rq, 0, sizeof(*rq));
> >>rq->q = q;
> >>rq->ref_count = 1;
> >>INIT_HLIST_NODE(&rq->hash);
> >>RB_CLEAR_NODE(&rq->rb_node);
> >>
> >>instead?
> >>
> >>The memset() is likely faster and smaller than one-by-one assignments 
> >>anyway, even if the one-by-ones can avoid initializing some field or 
> >>there ends up being a double initialization..
> >
> >The problem is store buffer compression.  At least a few years
> >ago this made a huge difference in sk_buff initialization in the
> >networking.
> >
> >Maybe cpus these days have so much store bandwith that doing
> >things like the above is OK, but I doubt it :-)
> 
> on modern x86 cpus the memset may even be faster if the memory isn't in 
> cache;
> the "explicit" method ends up doing Write Allocate on the cache lines
> (so read them from memory) even though they then end up being written 
> entirely.
> With memset the CPU is told that the entire range is set to a new value, and
> the WA can be avoided for the whole-cachelines in the range.

Don't you have write combining store buffers? Or is it still speculatively
issuing the reads even before the whole cacheline is combined?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 09:24:22AM +0100, Jens Axboe wrote:
> On Fri, Feb 08 2008, Nick Piggin wrote:
> > On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
> > > On Fri, Feb 08 2008, Nick Piggin wrote:
> > > > And if you don't?
> > > 
> > > Well if you don't ask for anything, you wont get anything :-)
> > > As I mentioned, the patch is a playing ground for trying various setups.
> > > Everything defaults to 'do as usual', set options to setup certain test
> > > scenarios.
> > 
> > I mean if you don't know the completing CPU.
> 
> I still don't know quite what part of that patch you are referring to
> here. If you don't have queue_affinity set, queueing a new request with
> the hardware is generally done on the same CPU that just completed a
> request. That is true even without any patches.

Generally, but I guess not always. The database workloads in question
(which you might know very well about ;)) apparently has a lot of
queue empty and unplug conditions. Which I guess is the reason for
Intel's initial patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-08 Thread Nick Piggin

On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
> On Fri, Feb 08 2008, Nick Piggin wrote:
> > And if you don't?
> 
> Well if you don't ask for anything, you wont get anything :-)
> As I mentioned, the patch is a playing ground for trying various setups.
> Everything defaults to 'do as usual', set options to setup certain test
> scenarios.

I mean if you don't know the completing CPU.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] more SLUB updates for 2.6.25

2008-02-08 Thread Nick Piggin

On Friday 08 February 2008 18:29, Eric Dumazet wrote:
> Nick Piggin a écrit :
> > On Friday 08 February 2008 13:13, Christoph Lameter wrote:
> >> are available in the git repository at:
> >>
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git
> >> slub-linus
> >>
> >> (includes the cmpxchg_local fastpath since the cmpxchg_local work
> >> by Matheiu is in now, and the non atomic unlock by Nick. Verified that
> >> this is not doing any harm after some other patches had been removed.
> >
> > Ah, good. I think it is always a good thing to be able to remove atomics.
> > They place quite a bit of burden on the CPU, especially x86 where it also
> > has implicit memory ordering semantics (although x86 can speculatively
> > get around much of the problem, it's obviously worse than no restriction)
> >
> > Even if perhaps some cache coherency or timing quirk makes the non-atomic
> > version slower (all else being equal), then I'd still say that the non
> > atomic version should be preferred.
>
> What about IRQ masking then ?

I really did mean all else being equal. eg. "clear_bit" vs "__clear_bit".


> Many CPU pay high cost for cli/sti pair...

True, and many UP architectures have to implement atomic operations
with cli/sti pairs... so those are more reasons to use non-atomics.


> And SLAB/SLUB allocators, even if only used from process context, want to
> disable/re-enable interrupts...
>
> I understand kmalloc() want generic pools, but dedicated pools could avoid
> this cli/sti

Sure, I guess that would be possible. I've kind of toyed with doing
some cli/sti mitigation in the page allocator, but in that case I
found that it wasn't a win outside microbenchmarks: the cache
characteristics of the returned pages are just as important if not
more so than cli/sti costs (although that balance would change
depending on the CPU and workload I guess).

For slub yes you could do it with fewer downsides with process context
pools.

Is it possible instead for architectures where cli/sti is so expensive
to change their lowest level of irq handling to do this by setting and
clearing a soft flag somewhere? That's what I'd rather see, if possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-07 Thread Nick Piggin

On Fri, Feb 08, 2008 at 08:47:47AM +0100, Jens Axboe wrote:
> On Fri, Feb 08 2008, Nick Piggin wrote:
> > On Thu, Feb 07, 2008 at 07:25:45PM +0100, Jens Axboe wrote:
> > > Hi,
> > > 
> > > Here's a variant using kernel threads only, the nasty arch bits are then
> > > not needed. Works for me, no performance testing (that's a hint for Alan
> > > to try and queue up some testing for this variant as well :-)
> > 
> > Well this stuff looks pretty nice (although I'm not sure whether the
> > softirq->thread changes are a good idea for performance, I guess we'll
> > see).
> 
> Yeah, that is indeed an open question and why I have two seperate
> patches for now (io-cpu-affinity branch and io-cpu-affinity-kthread
> branch). As Ingo mentioned, this is how softirqs are handled in the -rt
> branch already.
 
True, although there are some IO workloads where -rt falls behind
mainline. May not be purely due to irq threads though, of course.


> > You still don't have the option that the Intel patch gave, that is,
> > to submit on the completer. I guess that you could do it somewhat
> > generically by having a cpuid in the request queue, and update that
> > with the completing cpu.
> 
> Not sure what you mean, if setting queue_affinity doesn't accomplish it.
> If you know the completing CPU to begin with, surely you can just set
> the queuing affinity appropriately?

And if you don't?


> > At least they reported it to be the most efficient scheme in their
> > testing, and Dave thought that migrating completions out to submitters
> > might be a bottleneck in some cases.
> 
> More so than migrating submitters to completers? The advantage of only
> movign submitters is that you get rid of the completion locking. Apart
> from that, the cost should be the same, especially for the thread based
> solution.

Not specifically for the block layer, but higher layers like xfs.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-07 Thread Nick Piggin

On Tue, Feb 05, 2008 at 11:14:19AM +1100, David Chinner wrote:
> On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
> > You get better behaviour in the slab and page allocators and locality
> > and cache hotness of memory. For example, I guess in a filesystem /
> > pagecache heavy workload, you have to touch each struct page, buffer head,
> > fs private state, and also often have to wake the thread for completion.
> > Much of this data has just been touched at submit time, so doin this on
> > the same CPU is nice...
> 
> []
> 
> > I'm surprised that the xfs global state bouncing would outweigh the
> > bouncing of all the per-page/block/bio/request/etc data that gets touched
> > during completion. We'll see.
> 
> per-page/block.bio/request/etc is local to a single I/O. the only
> penalty is a cacheline bounce for each of the structures from one
> CPU to another.  That is, there is no global state modified by these
> completions.

Yeah, but it is going from _all_ submitting CPUs to the one completing
CPU. So you could bottleneck the interconnect at the completing CPU
just as much as if you had cachelines being pulled the other way (ie.
many CPUs trying to pull in a global cacheline).

 
> The real issue is metadata. The transaction log I/O completion
> funnels through a state machine protected by a single lock, which
> means completions on different CPUs pulls that lock to all
> completion CPUs. Given that the same lock is used during transaction
> completion for other state transitions (in task context, not intr),
> the more cpus active at once touches, the worse the problem gets.

OK, once you add locking (and not simply cacheline contention), then
the problem gets harder I agree. But I think that if the submitting
side takes the same locks as log completion (eg. maybe for starting a
new transaction), then it is not going to be a clear win either way,
and you'd have to measure it in the end.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-07 Thread Nick Piggin

On Thu, Feb 07, 2008 at 07:25:45PM +0100, Jens Axboe wrote:
> Hi,
> 
> Here's a variant using kernel threads only, the nasty arch bits are then
> not needed. Works for me, no performance testing (that's a hint for Alan
> to try and queue up some testing for this variant as well :-)

Well this stuff looks pretty nice (although I'm not sure whether the
softirq->thread changes are a good idea for performance, I guess we'll
see).

You still don't have the option that the Intel patch gave, that is,
to submit on the completer. I guess that you could do it somewhat
generically by having a cpuid in the request queue, and update that
with the completing cpu.

At least they reported it to be the most efficient scheme in their
testing, and Dave thought that migrating completions out to submitters
might be a bottleneck in some cases.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] more SLUB updates for 2.6.25

2008-02-07 Thread Nick Piggin

On Friday 08 February 2008 13:13, Christoph Lameter wrote:
> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git slub-linus
>
> (includes the cmpxchg_local fastpath since the cmpxchg_local work
> by Matheiu is in now, and the non atomic unlock by Nick. Verified that
> this is not doing any harm after some other patches had been removed.

Ah, good. I think it is always a good thing to be able to remove atomics.
They place quite a bit of burden on the CPU, especially x86 where it also
has implicit memory ordering semantics (although x86 can speculatively
get around much of the problem, it's obviously worse than no restriction)

Even if perhaps some cache coherency or timing quirk makes the non-atomic
version slower (all else being equal), then I'd still say that the non
atomic version should be preferred.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Nick Piggin

On Tuesday 05 February 2008 11:32, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Nick Piggin wrote:
> > Ok. But the approach is just not so good. If you _really_ need something
> > like that and it is a win over the regular non-atomic unlock, then you
> > just have to implement it as a generic locking / atomic operation and
> > allow all architectures to implement the optimal (and correct) memory
> > barriers.
>
> Assuming this really gives a benefit on several benchmarks then we need
> to think about how to do this some more. Its a rather strange form of
> locking.
>
> Basically you lock the page with a single atomic operation that sets
> PageLocked and retrieves the page flags.

This operation is not totally unusual. I could use it for my optimised
page lock patches for example (although I need an operation that clears
a flag and has release semantics, but similar class of "thing").


> Then we shovel the page state 
> around a couple of functions in a register and finally store the page
> state back which at the same time unlocks the page.

And this is a store-for-unlock (eg. with release semantics).
Nothing too special about that either I guess. (it is almost the word
equivalent of clear_bit_unlock).


> So two memory 
> references with one of them being atomic with none in between. We have
> nothing that can do something like that right now.

The load you are trying to avoid in the lock really isn't that
expensive. The cacheline is in L1. Even after a store, many CPUs
have store forwarding so it is probably not going to matter at all
on those.

Anyway, not saying the operations are useless, but they should be
made available to core kernel and implemented per-arch. (if they are
found to be useful)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Nick Piggin

On Tuesday 05 February 2008 10:47, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Nick Piggin wrote:
> > > erk, sorry, I misremembered.   I was about to merge all the patches we
> > > weren't going to merge.  oops.
> >
> > While you're there, can you drop the patch(es?) I commented on
> > and didn't get an answer to. Like the ones that open code their
> > own locking primitives and do risky looking things with barriers
> > to boot...
>
> That patch will be moved to a special archive for
> microbenchmarks. It shows the same issues like the __unlock patch.

Ok. But the approach is just not so good. If you _really_ need something
like that and it is a win over the regular non-atomic unlock, then you
just have to implement it as a generic locking / atomic operation and
allow all architectures to implement the optimal (and correct) memory
barriers.

Anyway


> > Also, WRT this one:
> > slub-use-non-atomic-bit-unlock.patch
> >
> > This is strange that it is unwanted. Avoiding atomic operations
> > is a pretty good idea. The fact that it appears to be slower on
> > some microbenchmark on some architecture IMO either means that
> > their __clear_bit_unlock or the CPU isn't implemented so well...
>
> Its slower on x86_64 and that is a pretty important arch. So
> I am to defer this until we have analyzed the situation some more. Could
> there be some effect of atomic ops on the speed with which a cacheline is
> released?

I'm sure it could have an effect. But why is the common case in SLUB
for the cacheline to be bouncing? What's the benchmark? What does SLAB
do in that benchmark, is it faster than SLUB there? What does the
non-atomic bit unlock do to Willy's database workload?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] SLUB updates for 2.6.25

2008-02-04 Thread Nick Piggin

On Tuesday 05 February 2008 09:30, Andrew Morton wrote:
> On Mon, 4 Feb 2008 14:28:45 -0800
>
> Andrew Morton <[EMAIL PROTECTED]> wrote:
> > > root (1):
> > >   SLUB: Do not upset lockdep
> >
> > err, what?  I though I was going to merge these:
> >
> > slub-move-count_partial.patch
> > slub-rename-numa-defrag_ratio-to-remote_node_defrag_ratio.patch
> > slub-consolidate-add_partial-and-add_partial_tail-to-one-function.patch
> > slub-use-non-atomic-bit-unlock.patch
> > slub-fix-coding-style-violations.patch
> > slub-noinline-some-functions-to-avoid-them-being-folded-into-alloc-free.p
> >atch
> > slub-move-kmem_cache_node-determination-into-add_full-and-add_partial.pat
> >ch
> > slub-avoid-checking-for-a-valid-object-before-zeroing-on-the-fast-path.pa
> >tch slub-__slab_alloc-exit-path-consolidation.patch
> > slub-provide-unique-end-marker-for-each-slab.patch
> > slub-avoid-referencing-kmem_cache-structure-in-__slab_alloc.patch
> > slub-optional-fast-path-using-cmpxchg_local.patch
> > slub-do-our-own-locking-via-slab_lock-and-slab_unlock.patch
> > slub-restructure-slab-alloc.patch
> > slub-comment-kmem_cache_cpu-structure.patch
> > slub-fix-sysfs-refcounting.patch
> >
> > before you went and changed things under my feet.
>
> erk, sorry, I misremembered.   I was about to merge all the patches we
> weren't going to merge.  oops.

While you're there, can you drop the patch(es?) I commented on
and didn't get an answer to. Like the ones that open code their
own locking primitives and do risky looking things with barriers
to boot...

Also, WRT this one:
slub-use-non-atomic-bit-unlock.patch

This is strange that it is unwanted. Avoiding atomic operations
is a pretty good idea. The fact that it appears to be slower on
some microbenchmark on some architecture IMO either means that
their __clear_bit_unlock or the CPU isn't implemented so well...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-02-04 Thread Nick Piggin

On Tuesday 05 February 2008 01:49, Mike Galbraith wrote:
> On Tue, 2008-01-22 at 06:47 +0100, Mike Galbraith wrote:
> > On Tue, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > On Tuesday 22 January 2008 16:03, Mike Galbraith wrote:
> > > > I've hit same twice recently (not pan, and not repeatable).
> > >
> > > Nasty. The attached patch is something really simple that can sometimes
> > > help. sysrq+p is also an option, if you're on a UP system.
> >
> > SMP (P4/HT imitating real cores)
> >
> > > Any luck getting traces?
> >
> > We'll see.  Armed.
>
> Hm.  ld just went loopy (but killable) in v2.6.24-6928-g9135f19.  During
> kbuild, modpost segfaulted, restart build, ld goes gaga.  Third attempt,
> build finished.  Not what I hit before, but mentionable.
>
>
> [  674.589134] modpost[18588]: segfault at 3e8dc42c ip 0804a96d sp af982920
> error 5 in modpost[8048000+9000] [  674.589211] mm/memory.c:115: bad pgd
> 3e081163.
> [  674.589214] mm/memory.c:115: bad pgd 3e0d2163.
> [  674.589217] mm/memory.c:115: bad pgd 3eb01163.

Hmm, this _could_ be bad memory. Or if it is very easy to reproduce with
a particular kernel version, then it is probably a memory scribble from
another part of the kernel :(

First thing I guess would be easy and helpful to run memtest86 for a
while if you have time.

If that's clean, then I don't have another good option except to bisect
the problem. Turning on DEBUG_VM, DEBUG_SLAB, DEBUG_LIST, DEBUG_PAGEALLOC,
DEBUG_STACKOVERFLOW, DEBUG_RODATA might help catch it sooner... SLAB and
PAGEALLOC could slow you down quite a bit though. And if the problem is
quite reproduceable, then obviously don't touch your config ;)

Thanks,
Nick


>
> [ 1407.322144]  ===
> [ 1407.322144] ldR running  0 21963  21962
> [ 1407.322144]db9d7f1c 00200086 c75f9020 b1814300 b0428300 b0428300
> b0428300 c75f9280 [ 1407.322144]b1814300 0001 db9d7000 
> d08c2f90 dba4f300 0002  [ 1407.322144]b1810120 dba4f334
> 00200046  db9d7000 c75f9020 db9d7f30 b02f333f [ 1407.322144] Call
> Trace:
> [ 1407.322144]  [] preempt_schedule_irq+0x45/0x5b
> [ 1407.322144]  [] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [] need_resched+0x1f/0x21
> [ 1407.322144]  [] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [] ? do_page_fault+0x4c/0x470
> [ 1407.322144]  [] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [] ? error_code+0x72/0x78
> [ 1407.322144]  [] ? init_transmeta+0xcf/0x22f <== zzt P4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Monthly md check == hung machine; how do I debug?

2008-02-04 Thread Nick Piggin

On Monday 04 February 2008 08:21, Robin Lee Powell wrote:
> I've got a machine with a 4 disk SATA raid10 configuration using md.
> The entire disk is loop-AES encrypted, but that shouldn't matter
> here.
>
> Once a month, Debian runs:
>
> /usr/share/mdadm/checkarray --cron --all --quiet
>
> and the machine hangs within 30 minutes of that starting.
>
> It seems that I can avoid the hang by not having "mdadm --monitor"
> running, but I'm not certain if that's the case or if I've just been
> lucky this go-round.
>
> I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
> Athlon(tm) 64 Processor 3700+.
>
> I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
> can't find anything that looks relevant.
>
> So, how can I (help you all) debug this?

Do you have a serial console? Does it respond to pings?

Can you try to get sysrq+T traces, and sysrq+P traces, and post
them?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-04 Thread Nick Piggin

On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote:
> On Sun, Feb 03 2008, Nick Piggin wrote:
> > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> > 
> > Hi guys,
> > 
> > Just had another way we might do this. Migrate the completions out to
> > the submitting CPUs rather than migrate submission into the completing
> > CPU.
> > 
> > I've got a basic patch that passes some stress testing. It seems fairly
> > simple to do at the block layer, and the bulk of the patch involves
> > introducing a scalable smp_call_function for it.
> > 
> > Now it could be optimised more by looking at batching up IPIs or
> > optimising the call function path or even mirating the completion event
> > at a different level...
> > 
> > However, this is a first cut. It actually seems like it might be taking
> > slightly more CPU to process block IO (~0.2%)... however, this is on my
> > dual core system that shares an llc, which means that there are very few
> > cache benefits to the migration, but non-zero overhead. So on multisocket
> > systems hopefully it might get to positive territory.
> 
> That's pretty funny, I did pretty much the exact same thing last week!

Oh nice ;)


> The primary difference between yours and mine is that I used a more
> private interface to signal a softirq raise on another CPU, instead of
> allocating call data and exposing a generic interface. That put the
> locking in blk-core instead, turning blk_cpu_done into a structure with
> a lock and list_head instead of just being a list head, and intercepted
> at blk_complete_request() time instead of waiting for an already raised
> softirq on that CPU.

Yeah I was looking at that... didn't really want to add the spinlock
overhead to the non-migration case. Anyway, I guess that sort of
fine implementation details is going to have to be sorted out with
results.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-04 Thread Nick Piggin

On Mon, Feb 04, 2008 at 03:40:20PM +1100, David Chinner wrote:
> On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
> > David Chinner wrote:
> > >Hi Nick,
> > >
> > >When Matthew was describing this work at an LCA presentation (not
> > >sure whether you were at that presentation or not), Zach came up
> > >with the idea that allowing the submitting application control the
> > >CPU that the io completion processing was occurring would be a good
> > >approach to try.  That is, we submit a "completion cookie" with the
> > >bio that indicates where we want completion to run, rather than
> > >dictating that completion runs on the submission CPU.
> > >
> > >The reasoning is that only the higher level context really knows
> > >what is optimal, and that changes from application to application.
> > 
> > well.. kinda. One of the really hard parts of the submit/completion stuff 
> > is that
> > the slab/slob/slub/slib allocator ends up basically "cycling" memory 
> > through the system;
> > there's a sink of free memory on all the submission cpus and a source of 
> > free memory
> > on the completion cpu. I don't think applications are capable of working 
> > out what is
> > best in this scenario..
> 
> Applications as in "anything that calls submit_bio()". i.e, direct I/O,
> filesystems, etc. i.e. not userspace but in-kernel applications.
> 
> In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
> contention of global structures in XFS. By controlling where completions are
> delivered, we can greatly reduce this contention, especially on large,
> mulitpathed devices that deliver interrupts to multiple CPUs that may be far
> distant from each other.  We have all the state and intelligence necessary
> to control this sort policy decision effectively.

Hi Dave,

Thanks for taking a look at the patch... yes it would be easy to turn
this bit of state into a more flexible cookie (eg. complete on submitter;
complete on interrupt; complete on CPUx/nodex etc.). Maybe we'll need
something that complex... I'm not sure, it would probably need more
fine tuning. That said, I just wanted to get this approach out there
early for rfc.

I guess both you and Arjan have points. For a _lot_ of things, completing
on the same CPU as submitter (whether that is migrating submission as in
the original patch in the thread, or migrating completion like I do).

You get better behaviour in the slab and page allocators and locality
and cache hotness of memory. For example, I guess in a filesystem /
pagecache heavy workload, you have to touch each struct page, buffer head,
fs private state, and also often have to wake the thread for completion.
Much of this data has just been touched at submit time, so doin this on
the same CPU is nice...

I'm surprised that the xfs global state bouncing would outweigh the
bouncing of all the per-page/block/bio/request/etc data that gets touched
during completion. We'll see.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-03 Thread Nick Piggin

On Sun, Feb 03, 2008 at 12:53:02PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Feb 3, 2008 11:52 AM, Nick Piggin <[EMAIL PROTECTED]> wrote:
> > +asmlinkage void smp_call_function_fast_interrupt(void)
> > +{
> 
> [snip]
> 
> > +   while (!list_empty(&list)) {
> > +   struct call_single_data *data;
> > +
> > +   data = list_entry(list.next, struct call_single_data, list);
> > +   list_del(&data->list);
> > +
> > +   data->func(data->info);
> > +   if (data->wait) {
> > +   smp_mb();
> > +   data->wait = 0;
> 
> Why do we need smp_mb() here (maybe add a comment to keep
> Andrew/checkpatch happy)?

Yeah, definitely... it's just a really basic RFC, but I should get
into the habit of just doing it anyway.

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-03 Thread Nick Piggin

On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> 
> Second experiment which we did was migrating the IO submission to the
> IO completion cpu. Instead of submitting the IO on the same cpu where the
> request arrived, in this experiment  the IO submission gets migrated to the
> cpu that is processing IO completions(interrupt). This will minimize the
> access to remote cachelines (that happens in timers, slab, scsi layers). The
> IO submission request is forwarded to the kblockd thread on the cpu receiving
> the interrupts. As part of this, we also made kblockd thread on each cpu as 
> the
> highest priority thread, so that IO gets submitted as soon as possible on the
> interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> resulted in 2% performance improvement and 3.3% improvement on two node ia64
> platform.
> 
> Quick and dirty prototype patch(not meant for inclusion) for this io migration
> experiment is appended to this e-mail.
> 
> Observation #1 mentioned above is also applicable to this experiment. CPU's
> processing interrupts will now have to cater IO submission/processing
> load aswell.
> 
> Observation #2: This introduces some migration overhead during IO submission.
> With the current prototype, every incoming IO request results in an IPI and
> context switch(to kblockd thread) on the interrupt processing cpu.
> This issue needs to be addressed and main challenge to address is
> the efficient mechanism of doing this IO migration(how much batching to do and
> when to send the migrate request?), so that we don't delay the IO much and at
> the same point, don't cause much overhead during migration.

Hi guys,

Just had another way we might do this. Migrate the completions out to
the submitting CPUs rather than migrate submission into the completing
CPU.

I've got a basic patch that passes some stress testing. It seems fairly
simple to do at the block layer, and the bulk of the patch involves
introducing a scalable smp_call_function for it.

Now it could be optimised more by looking at batching up IPIs or
optimising the call function path or even mirating the completion event
at a different level...

However, this is a first cut. It actually seems like it might be taking
slightly more CPU to process block IO (~0.2%)... however, this is on my
dual core system that shares an llc, which means that there are very few
cache benefits to the migration, but non-zero overhead. So on multisocket
systems hopefully it might get to positive territory.

---

Index: linux-2.6/arch/x86/kernel/smp_64.c
===
--- linux-2.6.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6/arch/x86/kernel/smp_64.c
@@ -321,6 +321,99 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
 }
 
+struct call_single_data {
+   struct list_head list;
+   void (*func) (void *info);
+   void *info;
+   int wait;
+};
+
+struct call_single_queue {
+   spinlock_t lock;
+   struct list_head list;
+};
+static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
+
+int __cpuinit init_smp_call(void)
+{
+   int i;
+
+   for_each_cpu_mask(i, cpu_possible_map) {
+   spin_lock_init(&per_cpu(call_single_queue, i).lock);
+   INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
+   }
+   return 0;
+}
+core_initcall(init_smp_call);
+
+/*
+ * this function sends a 'generic call function' IPI to all other CPU
+ * of the system defined in the mask.
+ */
+int smp_call_function_fast(int cpu, void (*func)(void *), void *info,
+   int wait)
+{
+   struct call_single_data *data;
+   struct call_single_queue *dst = &per_cpu(call_single_queue, cpu);
+   cpumask_t mask = cpumask_of_cpu(cpu);
+   int ipi;
+
+   data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
+   data->func = func;
+   data->info = info;
+   data->wait = wait;
+
+   spin_lock_irq(&dst->lock);
+   ipi = list_empty(&dst->list);
+   list_add_tail(&data->list, &dst->list);
+   spin_unlock_irq(&dst->lock);
+
+   if (ipi)
+   send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
+
+   if (wait) {
+   /* Wait for response */
+   while (data->wait)
+   cpu_relax();
+   kfree(data);
+   }
+
+   return 0;
+}
+
+asmlinkage void smp_call_function_fast_interrupt(void)
+{
+   struct call_single_queue *q;
+   unsigned long flags;
+   LIST_HEAD(list);
+
+   ack_APIC_irq();
+
+   q = &__get_cpu_var(call_single_queue);
+   spin_lock_irqsave(&q->lock, flags);
+   list_replace_init(&q->list, &list);
+   spin_unlock_irqrestore(&q->lock, flags);
+
+   exit_idle();
+   irq_enter();
+   while (!list_empty(&list)) {
+   struct call_single_data *data;
+
+   data = list_entry(list.next, struct call_single_data, list);

Re: [PATCH 3/3] uio: vm_operations_struct ->nopage to ->fault method conversion

2008-02-02 Thread Nick Piggin

On Saturday 02 February 2008 20:51, Denis Cheng wrote:
> Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>

Thanks, but already patched in -mm.

> ---
>  drivers/uio/uio.c |   19 ---
>  1 files changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
> index cc246fa..47e0c32 100644
> --- a/drivers/uio/uio.c
> +++ b/drivers/uio/uio.c
> @@ -417,30 +417,27 @@ static void uio_vma_close(struct vm_area_struct *vma)
>   idev->vma_count--;
>  }
>
> -static struct page *uio_vma_nopage(struct vm_area_struct *vma,
> -unsigned long address, int *type)
> +static int uio_vma_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>   struct uio_device *idev = vma->vm_private_data;
> - struct page* page = NOPAGE_SIGBUS;
>
>   int mi = uio_find_mem_index(vma);
>   if (mi < 0)
> - return page;
> + return VM_FAULT_SIGBUS;
>
>   if (idev->info->mem[mi].memtype == UIO_MEM_LOGICAL)
> - page = virt_to_page(idev->info->mem[mi].addr);
> + vmf->page = virt_to_page(idev->info->mem[mi].addr);
>   else
> - page = vmalloc_to_page((void*)idev->info->mem[mi].addr);
> - get_page(page);
> - if (type)
> - *type = VM_FAULT_MINOR;
> - return page;
> + vmf->page = vmalloc_to_page((void *)idev->info->mem[mi].addr);
> + get_page(vmf->page);
> +
> + return 0;
>  }
>
>  static struct vm_operations_struct uio_vm_ops = {
>   .open = uio_vma_open,
>   .close = uio_vma_close,
> - .nopage = uio_vma_nopage,
> + .fault = uio_vma_fault,
>  };
>
>  static int uio_mmap_physical(struct vm_area_struct *vma)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Feature Removals for 2.6.25

2008-02-01 Thread Nick Piggin

On Thu, Jan 31, 2008 at 05:38:42PM -0800, Harvey Harrison wrote:
> ---
> Ping?
> What: vm_ops.nopage
> When: Soon, provided in-kernel callers have been converted
> Why:  This interface is replaced by vm_ops.fault, but it has been around
>   forever, is used by a lot of drivers, and doesn't cost much to
>       maintain.
> Who:  Nick Piggin <[EMAIL PROTECTED]>

Well the in-kernel callers have not all been converted yet. I have
actually done the work, but it needs testing and merging by maintainers.
Getting it done during this merge window would be nice, I'm going to
try to make that happen after I get back from LCA. Otherwise probably
2.6.26.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [bug] as_merged_requests(): possible recursive locking detected

2008-02-01 Thread Nick Piggin

On Friday 01 February 2008 21:31, Jens Axboe wrote:
> On Fri, Feb 01 2008, Jens Axboe wrote:

> > I think the right solution is to remove swap_io_context() and fix the io
> > context referencing in as-iosched.c instead.
>
> IOW, the below. I don't know why Nick originally wanted to swap io
> contexts for a rq <-> rq merge, there seems little (if any) benefit to
> doing so.

Yeah, I guess this patch is fine. Simpler is better.

>
> diff --git a/block/as-iosched.c b/block/as-iosched.c
> index 9603684..852803e 100644
> --- a/block/as-iosched.c
> +++ b/block/as-iosched.c
> @@ -1266,22 +1266,8 @@ static void as_merged_requests(struct request_queue
> *q, struct request *req, */
>   if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
>   if (time_before(rq_fifo_time(next), rq_fifo_time(req))) {
> - struct io_context *rioc = RQ_IOC(req);
> - struct io_context *nioc = RQ_IOC(next);
> -
>   list_move(&req->queuelist, &next->queuelist);
>   rq_set_fifo_time(req, rq_fifo_time(next));
> - /*
> -  * Don't copy here but swap, because when anext is
> -  * removed below, it must contain the unused context
> -  */
> - if (rioc != nioc) {
> - double_spin_lock(&rioc->lock, &nioc->lock,
> - rioc < nioc);
> - swap_io_context(&rioc, &nioc);
> - double_spin_unlock(&rioc->lock, &nioc->lock,
> - rioc < nioc);
> - }
>   }
>   }
>
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 6d16755..80245dc 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -176,15 +176,6 @@ void copy_io_context(struct io_context **pdst, struct
> io_context **psrc) }
>  EXPORT_SYMBOL(copy_io_context);
>
> -void swap_io_context(struct io_context **ioc1, struct io_context **ioc2)
> -{
> - struct io_context *temp;
> - temp = *ioc1;
> - *ioc1 = *ioc2;
> - *ioc2 = temp;
> -}
> -EXPORT_SYMBOL(swap_io_context);
> -
>  int __init blk_ioc_init(void)
>  {
>   iocontext_cachep = kmem_cache_create("blkdev_ioc",
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index baba233..bbe3cf4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -39,7 +39,6 @@ void exit_io_context(void);
>  struct io_context *get_io_context(gfp_t gfp_flags, int node);
>  struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
>  void copy_io_context(struct io_context **pdst, struct io_context **psrc);
> -void swap_io_context(struct io_context **ioc1, struct io_context **ioc2);
>
>  struct request;
>  typedef void (rq_end_io_fn)(struct request *, int);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-02-01 Thread Nick Piggin

On Friday 01 February 2008 09:45, Frederik Himpe wrote:
> On ma, 2008-01-28 at 12:46 +1100, Nick Piggin wrote:
> > On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> > > On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > > > With Linux 2.6.24-rc8 I often have the problem that the pan
> > > > > > > usenet reader starts using 100% of CPU time after some time.
> > > > > > > When this happens, kill -9 does not work, and strace just hangs
> > > > > > > when trying to attach to the process. The same with gdb. ps
> > > > > > > shows the process as being in the R state.
> >
> > Well after trying a lot of writev combinations, I've reproduced a hang
> > *hangs head*.
> >
> > Does this help?
>
> Just to confirm: in four days of testing, I haven't seen the problem
> anymore, so it looks like this was indeed the right fix.

Thanks very much for reporting and testing. This patch needs to go
into 2.6.24.stable and upstream.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm: fix PageUptodate data race

2008-01-31 Thread Nick Piggin

Sorry, way behind on email here. I'll get through it slowly...

On Sat, Jan 26, 2008 at 10:03:56PM -0800, Andrew Morton wrote:
> > On Tue, 22 Jan 2008 05:01:14 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > After running SetPageUptodate, preceeding stores to the page contents to
> > actually bring it uptodate may not be ordered with the store to set the page
> > uptodate.
> > 
> > Therefore, another CPU which checks PageUptodate is true, then reads the
> > page contents can get stale data.
> > 
> > Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
> > PageUptodate.
> > 
> > Many places that test PageUptodate, do so with the page locked, and this
> > would be enough to ensure memory ordering in those places if SetPageUptodate
> > were only called while the page is locked. Unfortunately that is not always
> > the case for some filesystems, but it could be an idea for the future.
> > 
> > Also bring the handling of anonymous page uptodateness in line with that of
> > file backed page management, by marking anon pages as uptodate when they 
> > _are_
> > uptodate, rather than when our implementation requires that they be marked 
> > as
> > such. Doing allows us to get rid of the smp_wmb's in the page copying
> > functions, which were especially added for anonymous pages for an analogous
> > memory ordering problem. Both file and anonymous pages are handled with the
> > same barriers.
> > 
> 
> So...  it's two patches in one.

I guess so. Hmm, at least I appreciate it (them) getting testing in -mm
for now. I guess I should break it in two, do you agree Hugh? Do you
like/dislike the anonymous page change?


> What kernel is this against?  Looks like mainline.  Is it complete and
> correct when applied against the large number of pending MM changes?

Uh, I forget. But luckily this one should be quite correct reglardless
of pending mm changes... unless something there has fundamentally changed
the semantics or locking of PG_uptodate... which wouldn't be too surprising
actually ;)

No, it should be OK. I'll double check when I look at resubmitting it as
2 patches.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 arch updates also broke s390

2008-01-31 Thread Nick Piggin

On Thu, Jan 31, 2008 at 11:24:54AM +0100, Ingo Molnar wrote:
> 
> * Martin Schwidefsky <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, 2008-01-31 at 02:33 +0200, Adrian Bunk wrote:
> > > <--  snip  -->
> > > 
> > > ...
> > >   CC  arch/s390/kernel/asm-offsets.s
> > > In file included from 
> > > /home/bunk/linux/kernel-2.6/git/linux-2.6/arch/s390/kernel/asm-offsets.c:7:
> > > /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/sched.h: In
> > > function 'spin_needbreak':
> > > /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/sched.h:1931:
> > > error: implicit declaration of function '__raw_spin_is_contended'
> > > make[2]: *** [arch/s390/kernel/asm-offsets.s] Error 1
> > > 
> > > <--  snip  -->
> > 
> > Defining GENERIC_LOCKBREAK in arch/s390/Kconfig takes care of it. I'll 
> > cook up a patch and queue it in git390.
> 
> thanks!

Yeah thanks, don't know what happened with this, sorry. I thought I
had defined it for all SMP capable ones, so maybe it was a quilt
error or something on my part.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [0/18] Implement some low hanging BKL removal fruit in fs/*

2008-01-27 Thread Nick Piggin

On Sunday 27 January 2008 13:17, Andi Kleen wrote:
> [Andrew: I believe this is -mm material for .25]
>
> - Convert some more file systems (generally those who don't use the BKL
> for anything except mount) to use unlocked_bkl.
> - Implement BKL less fasync (see patch for the rationale)
> This is currently a separate entry point, but since the number of fasync
> users in the tree is relatively small I hope the older entry point can
> be removed then in the not too far future
> [help from other people converting more fasync users to unlocked_fasync
> would be appreciated]
> - Implement BKL less remote_llseek
> - While I was at it I also added a few missing compat ioctl handlers
> - Fix a few comments
>
> This fixes a lot of relatively trivial BKL users in fs/*. The main
> remaining non legacy offenders are now locks.c, nfs/nfsd and reiserfs.
> I believe BKL removal for all of those is being worked on by other people.
> Also a lot of "legacy" file systems still use it, but converting those
> does not seem to be very pressing.

BTW. here is a patch I did a while back for minix. I know it isn't
a big deal, but the work is done so I guess I should send it along.
The minix filesystem uses bkl to protect access to metadata. Switch
to a per-superblock mutex.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/fs/minix/bitmap.c
===
--- linux-2.6.orig/fs/minix/bitmap.c
+++ linux-2.6/fs/minix/bitmap.c
@@ -69,11 +69,11 @@ void minix_free_block(struct inode *inod
 		return;
 	}
 	bh = sbi->s_zmap[zone];
-	lock_kernel();
+	mutex_lock(&sbi->s_mutex);
 	if (!minix_test_and_clear_bit(bit, bh->b_data))
 		printk("minix_free_block (%s:%lu): bit already cleared\n",
 		   sb->s_id, block);
-	unlock_kernel();
+	mutex_unlock(&sbi->s_mutex);
 	mark_buffer_dirty(bh);
 	return;
 }
@@ -88,18 +88,18 @@ int minix_new_block(struct inode * inode
 		struct buffer_head *bh = sbi->s_zmap[i];
 		int j;
 
-		lock_kernel();
+		mutex_lock(&sbi->s_mutex);
 		j = minix_find_first_zero_bit(bh->b_data, bits_per_zone);
 		if (j < bits_per_zone) {
 			minix_set_bit(j, bh->b_data);
-			unlock_kernel();
+			mutex_unlock(&sbi->s_mutex);
 			mark_buffer_dirty(bh);
 			j += i * bits_per_zone + sbi->s_firstdatazone-1;
 			if (j < sbi->s_firstdatazone || j >= sbi->s_nzones)
 break;
 			return j;
 		}
-		unlock_kernel();
+		mutex_unlock(&sbi->s_mutex);
 	}
 	return 0;
 }
@@ -211,10 +211,10 @@ void minix_free_inode(struct inode * ino
 	minix_clear_inode(inode);	/* clear on-disk copy */
 
 	bh = sbi->s_imap[ino];
-	lock_kernel();
+	mutex_lock(&sbi->s_mutex);
 	if (!minix_test_and_clear_bit(bit, bh->b_data))
 		printk("minix_free_inode: bit %lu already cleared\n", bit);
-	unlock_kernel();
+	mutex_unlock(&sbi->s_mutex);
 	mark_buffer_dirty(bh);
  out:
 	clear_inode(inode);		/* clear in-memory copy */
@@ -237,7 +237,7 @@ struct inode * minix_new_inode(const str
 	j = bits_per_zone;
 	bh = NULL;
 	*error = -ENOSPC;
-	lock_kernel();
+	mutex_lock(&sbi->s_mutex);
 	for (i = 0; i < sbi->s_imap_blocks; i++) {
 		bh = sbi->s_imap[i];
 		j = minix_find_first_zero_bit(bh->b_data, bits_per_zone);
@@ -245,17 +245,17 @@ struct inode * minix_new_inode(const str
 			break;
 	}
 	if (!bh || j >= bits_per_zone) {
-		unlock_kernel();
+		mutex_unlock(&sbi->s_mutex);
 		iput(inode);
 		return NULL;
 	}
 	if (minix_test_and_set_bit(j, bh->b_data)) {	/* shouldn't happen */
-		unlock_kernel();
+		mutex_unlock(&sbi->s_mutex);
 		printk("minix_new_inode: bit already set\n");
 		iput(inode);
 		return NULL;
 	}
-	unlock_kernel();
+	mutex_unlock(&sbi->s_mutex);
 	mark_buffer_dirty(bh);
 	j += i * bits_per_zone;
 	if (!j || j > sbi->s_ninodes) {
Index: linux-2.6/fs/minix/dir.c
===
--- linux-2.6.orig/fs/minix/dir.c
+++ linux-2.6/fs/minix/dir.c
@@ -102,7 +102,7 @@ static int minix_readdir(struct file * f
 	char *name;
 	__u32 inumber;
 
-	lock_kernel();
+	mutex_lock(&sbi->s_mutex);
 
 	pos = (pos + chunk_size-1) & ~(chunk_size-1);
 	if (pos >= inode->i_size)
@@ -146,7 +146,7 @@ static int minix_readdir(struct file * f
 
 done:
 	filp->f_pos = (n << PAGE_CACHE_SHIFT) | offset;
-	unlock_kernel();
+	mutex_unlock(&sbi->s_mutex);
 	return 0;
 }
 
Index: linux-2.6/fs/minix/inode.c
===
--- linux-2.6.orig/fs/minix/inode.c
+++ linux-2.6/fs/minix/inode.c
@@ -174,6 +174,7 @@ static int minix_fill_super(struct super
 	sbi->s_firstdatazone = ms->s_firstdatazone;
 	sbi->s_log_zone_size = ms->s_log_zone_size;
 	sbi->s_max_size = ms->s_max_size;
+	mutex_init(&sbi->s_mutex);
 	s->

Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-01-27 Thread Nick Piggin

On Sunday 27 January 2008 01:27, Pascal Terjan wrote:
> Nick Piggin  yahoo.com.au> writes:
> > On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> > > I just succeeded to reproduce the problem with this patch. Does this
> > > smell like an XFS problem?
>
> I got the same issue using ext3
>
> > Possible. Though I think it is more likely to be a bug in the
> > new deadlock avoidance code in the generic buffered write path.
> > Dang... I wonder why this hasn't come up earlier. It looks like
> > pan's use of writev might be tickling it.
> >
> > How quickly can you reproduce this?
>
> When I was using pan daily one month ago, I got it twice over a week
>
> > Can you use strace to see what the hanging syscall looks like?
>
> I tried last week during 5 hours without luck, I can try again

Dang, I didn't see any reports of this earlier :(

Anyway, I sent a patch to fix it in the original thread (can you
reply-to-all please? just it a bit easier to keep threads together)

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-01-27 Thread Nick Piggin

On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > > > reader starts using 100% of CPU time after some time. When this
> > > > > happens, kill -9 does not work, and strace just hangs when trying
> > > > > to attach to the process. The same with gdb. ps shows the process
> > > > > as being in the R state.
> > > > >
> > > > > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > > > > Jan 21 21:45:01 Anastacia kernel: pan   R  running task
> > > > > 0
> >
> > Nasty. The attached patch is something really simple that can sometimes
> > help. sysrq+p is also an option, if you're on a UP system.
> >
> > Any luck getting traces?
>
> I just succeeded to reproduce the problem with this patch. Does this
> smell like an XFS problem?
>
> Jan 26 14:17:43 Anastacia kernel: pan   R  running task0 
> 7564  1 Jan 26 14:17:43 Anastacia kernel:  3f5b3248
> 1000 880c28b0  Jan 26 14:17:43
> Anastacia kernel:  81003f5b3248 81002d1ed900 2d1ed900
>  Jan 26 14:17:43 Anastacia kernel:  810016050dd0
> f000f000  81002d1eda10 Jan 26 14:17:43
> Anastacia kernel: Call Trace:
> Jan 26 14:17:43 Anastacia kernel:  [_end+127964408/2129947720]
> :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43 Anastacia kernel: 
> [unix_poll+0/176] unix_poll+0x0/0xb0 Jan 26 14:17:43 Anastacia kernel: 
> [_end+127964408/2129947720] :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43
> Anastacia kernel:  [iov_iter_copy_from_user_atomic+65/160]
> iov_iter_copy_from_user_atomic+0x41/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> [iov_iter_copy_from_user_atomic+46/160]
> iov_iter_copy_from_user_atomic+0x2e/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> [generic_file_buffered_write+383/1728]

Well after trying a lot of writev combinations, I've reproduced a hang
*hangs head*.

Does this help?
Zero length iovecs can go into an infinite loop in writev, because the
iovec iterator does not always advance over them.

The sequence required to trigger this is not trivial. I think it requires
that a zero-length iovec be followed by a non-zero-length iovec which causes
a pagefault in the atomic usercopy. This causes the writev code to drop back
into single-segment copy mode, which then tries to copy the 0 bytes of the
zero-length iovec; a zero length copy looks like a failure though, so it
loops.

Put a test into iov_iter_advance to catch zero-length iovecs. We could just
put the test in the fallback path, but I feel it is more robust to skip
over zero-length iovecs throughout the code (iovec iterator may be used in
filesystems too, so it should be robust).

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1733,7 +1733,11 @@ static void __iov_iter_advance_iov(struc
 		const struct iovec *iov = i->iov;
 		size_t base = i->iov_offset;
 
-		while (bytes) {
+		/*
+		 * The !iov->iov_len check ensures we skip over unlikely
+		 * zero-length segments.
+		 */
+		while (bytes || !iov->iov_len) {
 			int copy = min(bytes, iov->iov_len - base);
 
 			bytes -= copy;
@@ -2251,6 +2255,7 @@ again:
 
 		cond_resched();
 
+		iov_iter_advance(i, copied);
 		if (unlikely(copied == 0)) {
 			/*
 			 * If we were unable to copy any data at all, we must
@@ -2264,7 +2269,6 @@ again:
 		iov_iter_single_seg_count(i));
 			goto again;
 		}
-		iov_iter_advance(i, copied);
 		pos += copied;
 		written += copied;

Re: [RFC] some page can't be migrated

2008-01-27 Thread Nick Piggin

On Sunday 27 January 2008 17:03, Andrew Morton wrote:
> > On Fri, 25 Jan 2008 14:03:25 +0800 Shaohua Li <[EMAIL PROTECTED]>
> > wrote:
> >
> > -   if (!page->mapping)
> > +   if (!page->mapping) {
> > +   if (!PageAnon(page) && PagePrivate(page))
> > +   try_to_release_page(page, GFP_KERNEL);
> > goto rcu_unlock;
> > +   }
>
> We call something(GFP_KERNEL) under rcu_read_lock()?  I've lost track of
> the myriad flavours of rcu which we purport to support, but I don't think
> they'll all like us blocking under rcu_read_lock().
>
> We _won't_ block, because try_to_release_page() will see the NULL ->mapping
> and will call the non-blocking try_to_free_buffers().  But still, it looks
> bad, and will cause problems if someone decides to add a might_sleep_if()
> to try_to_release_page().
>
> So...  I'd suggest that it would be better to add an apologetic comment and
> call direct into try_to_free_buffers().

You're right, but can't we just rcu_read_unlock() before try_to_release_page?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-01-26 Thread Nick Piggin

On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > > > reader starts using 100% of CPU time after some time. When this
> > > > > happens, kill -9 does not work, and strace just hangs when trying
> > > > > to attach to the process. The same with gdb. ps shows the process
> > > > > as being in the R state.
> > > > >
> > > > > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > > > > Jan 21 21:45:01 Anastacia kernel: pan   R  running task
> > > > > 0
> >
> > Nasty. The attached patch is something really simple that can sometimes
> > help. sysrq+p is also an option, if you're on a UP system.
> >
> > Any luck getting traces?
>
> I just succeeded to reproduce the problem with this patch. Does this
> smell like an XFS problem?

Possible. Though I think it is more likely to be a bug in the
new deadlock avoidance code in the generic buffered write path.
Dang... I wonder why this hasn't come up earlier. It looks like
pan's use of writev might be tickling it.

How quickly can you reproduce this?

Can you use strace to see what the hanging syscall looks like?

Thanks,
Nick


> Jan 26 14:17:43 Anastacia kernel: pan   R  running task0 
> 7564  1 Jan 26 14:17:43 Anastacia kernel:  3f5b3248
> 1000 880c28b0  Jan 26 14:17:43
> Anastacia kernel:  81003f5b3248 81002d1ed900 2d1ed900
>  Jan 26 14:17:43 Anastacia kernel:  810016050dd0
> f000f000  81002d1eda10 Jan 26 14:17:43
> Anastacia kernel: Call Trace:
> Jan 26 14:17:43 Anastacia kernel:  [_end+127964408/2129947720]
> :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43 Anastacia kernel: 
> [unix_poll+0/176] unix_poll+0x0/0xb0 Jan 26 14:17:43 Anastacia kernel: 
> [_end+127964408/2129947720] :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43
> Anastacia kernel:  [iov_iter_copy_from_user_atomic+65/160]
> iov_iter_copy_from_user_atomic+0x41/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> [iov_iter_copy_from_user_atomic+46/160]
> iov_iter_copy_from_user_atomic+0x2e/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> [generic_file_buffered_write+383/1728]
> generic_file_buffered_write+0x17f/0x6c0 Jan 26 14:17:43 Anastacia kernel: 
> [current_fs_time+30/48] current_fs_time+0x1e/0x30 Jan 26 14:17:43 Anastacia
> kernel:  [_end+127997742/2129947720] :xfs:xfs_write+0x676/0x910 Jan 26
> 14:17:43 Anastacia kernel:  [find_lock_page+61/192]
> find_lock_page+0x3d/0xc0 Jan 26 14:17:43 Anastacia kernel: 
> [_end+127981080/2129947720] :xfs:xfs_file_aio_write+0x0/0x50 Jan 26
> 14:17:43 Anastacia kernel:  [do_sync_readv_writev+203/272]
> do_sync_readv_writev+0xcb/0x110 Jan 26 14:17:43 Anastacia kernel: 
> [__do_fault+501/1056] __do_fault+0x1f5/0x420 Jan 26 14:17:43 Anastacia
> kernel:  [autoremove_wake_function+0/48] autoremove_wake_function+0x0/0x30
> Jan 26 14:17:43 Anastacia kernel:  [handle_mm_fault+1344/2048]
> handle_mm_fault+0x540/0x800 Jan 26 14:17:43 Anastacia kernel: 
> [rw_copy_check_uvector+157/336] rw_copy_check_uvector+0x9d/0x150 Jan 26
> 14:17:43 Anastacia kernel:  [do_readv_writev+253/560]
> do_readv_writev+0xfd/0x230 Jan 26 14:17:43 Anastacia kernel: 
> [sys_writev+83/144] sys_writev+0x53/0x90 Jan 26 14:17:43 Anastacia kernel: 
> [system_call+126/131] system_call+0x7e/0x83 Jan 26 14:17:43 Anastacia
> kernel:


> SysRq : Show Regs
> CPU 0:
> Modules linked in: usb_storage af_packet nvidia(P) vboxdrv ipv6 fuse
> snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq cpufreq_ondemand
> video output tc1100_wmi sbs sbshc container dock battery ac binfmt_misc
> loop ext3 jbd dm_mirror sr_mod dm_mod pata_amd ata_generic sata_sil
> usbmouse usbhid ff_memless floppy usblp powernow_k8 freq_table
> snd_pcm_oss snd_mixer_oss snd_mpu401 snd_mpu401_uart snd_rawmidi ns558
> gameport parport_pc snd_seq_device parport rtc_cmos pcspkr snd_intel8x0
> k8temp snd_ac97_codec ohci1394 ac97_bus ieee1394 snd_pcm snd_timer skge
> ohci_hcd ehci_hcd snd soundcore usbcore forcedeth snd_page_alloc ssb fan
> pcmcia pcmcia_core i2c_nforce2 i2c_core button thermal processor sg
> evdev genrtc xfs scsi_wait_scan sd_mod sata_nv libata scsi_mod
> Pid: 7564, comm: pan Tainted: P2.6.24-desktop-0.rc8.2.1mdv #1
> RIP: 0010:[]  [] block_write_begin
> +0x87/0xe0
> RSP: 0018:81002e9b5ac8  EFLAGS: 0286
> RAX: 81003f5b3248 RBX: fff4 RCX: 
> RDX: 81003f5b3248 RSI:  RDI: 81002d1eda

Re: Unpredictable performance

2008-01-25 Thread Nick Piggin

On Saturday 26 January 2008 02:03, Asbjørn Sannes wrote:
> Asbjørn Sannes wrote:
> > Nick Piggin wrote:
> >> On Friday 25 January 2008 22:32, Asbjorn Sannes wrote:
> >>> Hi,
> >>>
> >>> I am experiencing unpredictable results with the following test
> >>> without other processes running (exception is udev, I believe):
> >>> cd /usr/src/test
> >>> tar -jxf ../linux-2.6.22.12
> >>> cp ../working-config linux-2.6.22.12/.config
> >>> cd linux-2.6.22.12
> >>> make oldconfig
> >>> time make -j3 > /dev/null # This is what I note down as a "test" result
> >>> cd /usr/src ; umount /usr/src/test ; mkfs.ext3 /dev/cc/test
> >>> and then reboot
> >>>
> >>> The kernel is booted with the parameter mem=8192
> >>>
> >>> For 2.6.23.14 the results vary from (real time) 33m30.551s to
> >>> 45m32.703s (30 runs)
> >>> For 2.6.23.14 with nop i/o scheduler from 29m8.827s to 55m36.744s (24
> >>> runs) For 2.6.22.14 also varied a lot.. but, lost results :(
> >>> For 2.6.20.21 only vary from 34m32.054s to 38m1.928s (10 runs)
> >>>
> >>> Any idea of what can cause this? I have tried to make the runs as equal
> >>> as possible, rebooting between each run.. i/o scheduler is cfq as
> >>> default.
> >>>
> >>> sys and user time only varies a couple of seconds.. and the order of
> >>> when it is "fast" and when it is "slow" is completly random, but it
> >>> seems that the results are mostly concentrated around the mean.
> >>
> >> Hmm, lots of things could cause it. With such big variations in
> >> elapsed time, and small variations on CPU time, I guess the fs/IO
> >> layers are the prime suspects, although it could also involve the
> >> VM if you are doing a fair amount of page reclaim.
> >>
> >> Can you boot with enough memory such that it never enters page
> >> reclaim? `grep scan /proc/vmstat` to check.
> >>
> >> Otherwise you could mount the working directory as tmpfs to
> >> eliminate IO.
> >>
> >> bisecting it down to a single patch would be really helpful if you
> >> can spare the time.
> >
> > I'm going to run some tests without limiting the memory to 80 megabytes
> > (so that it is 2 gigabyte) and see how much it varies then, but iff I
> > recall correctly it did not vary much. I'll reply to this e-mail with
> > the results.
>
> 5 runs gives me:
> real5m58.626s
> real5m57.280s
> real5m56.584s
> real5m57.565s
> real5m56.613s
>
> Should I test with tmpfs aswell?

I wouldn't worry about it. It seems like it might be due to page reclaim
(fs / IO can't be ruled out completely though). Hmm, I haven't been following
reclaim so closely lately; you say it started going bad around 2.6.22? It
may be lumpy reclaim patches?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Unpredictable performance

2008-01-25 Thread Nick Piggin

On Friday 25 January 2008 22:32, Asbjorn Sannes wrote:
> Hi,
>
> I am experiencing unpredictable results with the following test
> without other processes running (exception is udev, I believe):
> cd /usr/src/test
> tar -jxf ../linux-2.6.22.12
> cp ../working-config linux-2.6.22.12/.config
> cd linux-2.6.22.12
> make oldconfig
> time make -j3 > /dev/null # This is what I note down as a "test" result
> cd /usr/src ; umount /usr/src/test ; mkfs.ext3 /dev/cc/test
> and then reboot
>
> The kernel is booted with the parameter mem=8192
>
> For 2.6.23.14 the results vary from (real time) 33m30.551s to 45m32.703s
> (30 runs)
> For 2.6.23.14 with nop i/o scheduler from 29m8.827s to 55m36.744s (24 runs)
> For 2.6.22.14 also varied a lot.. but, lost results :(
> For 2.6.20.21 only vary from 34m32.054s to 38m1.928s (10 runs)
>
> Any idea of what can cause this? I have tried to make the runs as equal
> as possible, rebooting between each run.. i/o scheduler is cfq as default.
>
> sys and user time only varies a couple of seconds.. and the order of
> when it is "fast" and when it is "slow" is completly random, but it
> seems that the results are mostly concentrated around the mean.

Hmm, lots of things could cause it. With such big variations in
elapsed time, and small variations on CPU time, I guess the fs/IO
layers are the prime suspects, although it could also involve the
VM if you are doing a fair amount of page reclaim.

Can you boot with enough memory such that it never enters page
reclaim? `grep scan /proc/vmstat` to check.

Otherwise you could mount the working directory as tmpfs to
eliminate IO.

bisecting it down to a single patch would be really helpful if you
can spare the time.

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH UPDATE] x86: ignore spurious faults

2008-01-25 Thread Nick Piggin

On Friday 25 January 2008 19:15, Jan Beulich wrote:
> Actually, another thought: permitting (and handling) spurious faults for
> kernel mappings conflicts with NMI handling, i.e. great care would be
> needed to ensure the NMI path cannot touch any such mapping. So
> even the present Xen/Linux Dom0 implementation may have some
> (perhaps unlikely) problems here, and it would get worse if we added
> e.g. a virtual watchdog NMI (something I am considering, which would
> then extend the problem to DomU-s).

Can you explain how they conflict?

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] Minimal fix for private_list handling races

2008-01-25 Thread Nick Piggin

On Thursday 24 January 2008 02:48, Jan Kara wrote:
> On Thu 24-01-08 02:05:16, Nick Piggin wrote:
> > On Thursday 24 January 2008 00:30, Jan Kara wrote:
> > > On Wed 23-01-08 12:00:02, Nick Piggin wrote:
> > > > On Wednesday 23 January 2008 04:10, Jan Kara wrote:
> > > > >   Hi,
> > > > >
> > > > >   as I got no answer for a week, I'm resending this fix for races
> > > > > in private_list handling. Andrew, do you like them more than the
> > > > > previous version?
> > > >
> > > > FWIW, I reviewed this, and it looks OK although I think some comments
> > > > would be in order.
> > >
> > >   Thanks.
> > >
> > > > What would be really nice is to avoid the use of b_assoc_buffers
> > > > completely in this function like I've attempted (untested). I don't
> > > > know if you'd actually call that an improvement...?
> > >
> > >   I thought about this solution as well. But main issue I had with this
> > > solution is that currently, you nicely submit all the metadata buffers
> > > at once, so that block layer can sort them and write them in nice
> > > order. With the array you submit buffers by 16 (or any other fixed
> > > amount) and in mostly random order... So IMHO fsync would become
> > > measurably slower.
> >
> > Oh, I don't know the filesystems very well... which ones would
> > attach a large number of metadata buffers to the inode?
>
>   This logic is actually used only by a few filesystems - ext2 and UDF are
> probably the most common ones. For example for ext2, the indirect blocks
> are on the list if the file is freshly written, so that is roughly around
> 1MB of metadata per 1GB of data (for 4KB blocks, with 1KB blocks it is 4MB
> per 1GB). Because seeks are expensive, you could really end up with the
> write being 16 times slower when you do it in 16 passes instead of one...

Yeah, that's fair enough I suppose. I wasn't thinking you'd have a
huge newly dirtied file, but it could happen. I don't want to cause
regressions.


> > With that in mind, doesn't your first patch suffer from a race due to
> > exactly this unlocked list_empty check when you are removing clean
> > buffers from the queue?
> >
> >  if (!buffer_dirty(bh) && !buffer_locked(bh))
> > mark_buffer_dirty()
> > if (list_empty(&bh->b_assoc_buffers))
> >  /* add */
> >  __remove_assoc_queue(bh);
> >
> > Which results in the buffer being dirty but not on the ->private_list,
> > doesn't it?
>
>   Hmm, I'm not sure about which patch you speak. Logic with removing clean
> buffers has been in the first version (but there mark_buffer_dirty_inode()
> was written differently).

Ah, yes I see I missed that. I like that a lot better.


> In the current version, we readd buffer to 
> private_list if it is found dirty in the second while loop of
> fsync_buffers() and that should be enough.

Sure, I think there is still a data race though, but if there is one
it's already been there for a long time and nobody cares too much about
those anyway.


> > But let's see... there must be a memory ordering problem here in existing
> > code anyway, because I don't see any barriers. Between b_assoc_buffers
> > and b_state (via buffer_dirty); fsync_buffers_list vs
> > mark_buffer_dirty_inode, right?
>
>   I'm not sure. What exactly to you mean? BTW: spin_lock is a memory
> barrier, isn't it?

In existing code:

mark_buffer_dirty_inode():  fsync_buffers_list():
 test_set_buffer_dirty(bh);  list_del_init(&bh->b_assoc_buffers);
 if (list_empty(&bh->b_assoc_buffers))   if (buffer_dirty(bh)) {
 ...   list_add(&bh->b_assoc_buffers, );

These two code sequences can run concurrently because only fsync_buffers_list
takes the lock.

So fsync_buffers_list can speculatively load bh->b_state before
its stores to clear b_assoc_buffers propagate to the CPU running
mark_buffer_dirty_inode.

So if there is a !dirty buffer on the list, then fsync_buffers_list will
remove it from the list, but mark_buffer_dirty_inode won't see it has been
removed from the list and won't re-add it. I think.

This is actually even possible to hit on x86 because they reorder loads
past stores. It needs a smp_mb() before if (buffer_dirty(bh) {}.

Actually I very much dislike testing list entries locklessly, because they
are not trivially atomic operations like single stores... which is another
reason why I like your first patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] some page can't be migrated

2008-01-24 Thread Nick Piggin

On Friday 25 January 2008 14:09, Shaohua Li wrote:
> On Fri, 2008-01-25 at 14:03 +1100, Nick Piggin wrote:
> > On Wednesday 23 January 2008 17:22, Shaohua Li wrote:
> > > Anonymous page might have fs-private metadata, the page is truncated.
> > > As the page hasn't mapping, page migration refuse to migrate the page.
> > > It appears the page is only freed in page reclaim and if zone watermark
> > > is low, the page is never freed, as a result migration always fail. I
> > > thought we could free the metadata so such page can be freed in
> > > migration and make migration more reliable?
> >
> > Anonymous pages should not have fs-private metadata.
> >
> > Orphaned pages I guess you mean?
>
> yes, maybe, but the comments in truncate_complete_page called the page
> anonymous.

Ah, I see. I think we should use orphaned (or anything except
anonymous) to describe these pages.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] some page can't be migrated

2008-01-24 Thread Nick Piggin

On Friday 25 January 2008 15:01, Christoph Lameter wrote:
> Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
>
> Nick? Ok with you too?

Yeah, for memory hot remove that makes sense. A comment
might be in order, at least a reference to the orphaned
page code in vmscan.c.

Otherwise, it is OK by me.

Acked-by: Nick Piggin <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] some page can't be migrated

2008-01-24 Thread Nick Piggin

On Wednesday 23 January 2008 17:22, Shaohua Li wrote:
> Anonymous page might have fs-private metadata, the page is truncated. As
> the page hasn't mapping, page migration refuse to migrate the page. It
> appears the page is only freed in page reclaim and if zone watermark is
> low, the page is never freed, as a result migration always fail. I
> thought we could free the metadata so such page can be freed in
> migration and make migration more reliable?

Anonymous pages should not have fs-private metadata.

Orphaned pages I guess you mean? They should not be accessable via
the pagecache or the page tables, so how do they keep tangling up
migration? Where/how is migration finding these pages?!


>
> Thanks,
> Shaohua
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6a207e8..6bc38f7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -633,6 +633,17 @@ static int unmap_and_move(new_page_t get_new_page,
> unsigned long private, goto unlock;
>   wait_on_page_writeback(page);
>   }
> +
> + /*
> +  * See truncate_complete_page(). Anonymous page might have
> +  * fs-private metadata, the page is truncated. Such page can't be
> +  * migrated. Try to free metadata, so the page can be freed.
> +  */
> + if (!page->mapping && !PageAnon(page) && PagePrivate(page)) {
> + try_to_release_page(page, GFP_KERNEL);
> + goto unlock;
> + }
> +
>   /*
>* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
>* we cannot notice that anon_vma is freed while we migrates a page.
>
>
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH UPDATE] x86: ignore spurious faults

2008-01-24 Thread Nick Piggin

On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> > There's perhaps an opportunity to do this lazy TLB trick in the mmap
> > path as well, where RW mappings are initially mapped as RO so we can
> > catch processes dirtying them and then switched to RW. If the mapping is
> > shared across threads on multiple cores, we can defer synchronizing the
> > TLBs on the others.
>
> I think spurious usermode faults are already dealt with.
> handle_pte_fault() does essentially the same thing as this patch:
>
>   if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
>   update_mmu_cache(vma, address, entry);
>   } else {
>   /*
>* This is needed only for protection faults but the arch code
>* is not yet telling us if this is a protection fault or not.
>* This still avoids useless tlb flushes for .text page faults
>* with threads.
>*/
>   if (write_access)
>   flush_tlb_page(vma, address);
>   }

I (obviously) don't know exactly how the TLB works in x86, but I
thought that on a miss, the CPU walks the pagetables first before
faulting? Maybe that's not the case if there is an RO entry
actually in the TLB?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] lockless get_user_pages for dio (and more)

2008-01-23 Thread Nick Piggin

On Thursday 17 January 2008 06:58, Dave Kleikamp wrote:

> We weren't able to get in any runs before the holidays, but we finally
> have some good news from our performance team:
>
> "To test the effects of the patch, an OLTP workload was run on an IBM
> x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
> 2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
> runs with and without the patch resulted in an overall performance
> benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
> __up_read and __down_read routines that is seen during thread contention
> for system resources was reduced from 2.8% down to .05%. Monitoring
> the /proc/vmstat output from the patched run showed that the counter for
> fast_gup contained a very high number while the fast_gup_slow value was
> zero."

Just for reference, I've attached a more complete patch for x86,
which has to be applied on top of the pte_special patch posted in
another thread.

No need to test anything at this point... the generated code for
this version is actually slightly better than the last one despite
the extra condition being tested for. With a few tweak I was
actually able to reduce the number of tests in the inner loop, and
adding noinline to the leaf functions helps keep them in registers.

I'm currently having a look at an initial powerpc 64 patch,
hopefully we'll see similar improvements there. Will post that when
I get further along with it.

Thanks,
Nick
Introduce a new "fast_gup" (for want of a better name right now) which
is basically a get_user_pages with a less general API that is more suited
to the common case.

- task and mm are always current and current->mm
- force is always 0
- pages is always non-NULL
- don't pass back vmas

This allows (at least on x86), an optimistic lockless pagetable walk,
without taking any page table locks or even mmap_sem. Page table existence
is guaranteed by turning interrupts off (combined with the fact that we're
always looking up the current mm, which would need an IPI before its
pagetables could be shot down from another CPU).

Many other architectures could do the same thing. Those that don't IPI
could potentially RCU free the page tables and do speculative references
on the pages (a la lockless pagecache) to achieve a lockless fast_gup.


---
Index: linux-2.6/arch/x86/lib/Makefile_64
===
--- linux-2.6.orig/arch/x86/lib/Makefile_64
+++ linux-2.6/arch/x86/lib/Makefile_64
@@ -10,4 +10,4 @@ obj-$(CONFIG_SMP)	+= msr-on-cpu.o
 lib-y := csum-partial_64.o csum-copy_64.o csum-wrappers_64.o delay_64.o \
 	usercopy_64.o getuser_64.o putuser_64.o  \
 	thunk_64.o clear_page_64.o copy_page_64.o bitstr_64.o bitops_64.o
-lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o copy_user_nocache_64.o
+lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o copy_user_nocache_64.o gup.o
Index: linux-2.6/arch/x86/lib/gup.c
===
--- /dev/null
+++ linux-2.6/arch/x86/lib/gup.c
@@ -0,0 +1,189 @@
+/*
+ * Lockless fast_gup for x86
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask, result;
+	pte_t *ptep;
+
+	result = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		result |= _PAGE_RW;
+	mask = result | _PAGE_SPECIAL;
+
+	ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * XXX: careful. On 3-level 32-bit, the pte is 64 bits, and
+		 * we need to make sure we load the low word first, then the
+		 * high. This means _PAGE_PRESENT should be clear if the high
+		 * word was not valid. Currently, the C compiler can issue
+		 * the loads in any order, and I don't know of a wrapper
+		 * function that will do this properly, so it is broken on
+		 * 32-bit 3-level for the moment.
+		 */
+		pte_t pte = *ptep;
+		struct page *page;
+
+		if ((pte_val(pte) & mask) != result)
+			return 0;
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+		get_page(page);
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	pte_unmap(ptep - 1);
+
+	return 1;
+}
+
+static inline void get_head_page_multiple(struct page *page, int nr)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_add(nr, &page->_count);
+}
+
+static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long

Re: [PATCH 3/4] firewire: enforce access order between generation and node ID

2008-01-23 Thread Nick Piggin

On Thursday 24 January 2008 11:54, Stefan Richter wrote:
> fw_device.node_id and fw_device.generation are accessed without mutexes.
> We have to ensure that all readers will get to see node_id updates
> before generation updates.
>
> An earlier incarnation of this patch fixes an inability to recognize
> devices after "giving up on config rom",
> https://bugzilla.redhat.com/show_bug.cgi?id=429950
>
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>
> ---
>
> Rework of patches
> firewire: fw-core: enforce write order when updating
> fw_device.generation and parts of
> firewire: fw-core: react on bus resets while the config ROM is being
> fetched firewire: fw-sbp2: enforce read order of device generation and node
> ID from November 1 2007.
>
> Update:
>   - write site and read sites folded into one patch
>   - added fix to fw_device_enable_phys_dma() and fill_bus_reset_event()
>   - smp_ barriers are sufficient
>   - comments, changelog

I don't know the firewire subsystem at all, but the barriers seem
right (in that they match your description of the problem), and
comments for them are really good.

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Nick Piggin

On Thursday 24 January 2008 04:05, Linus Torvalds wrote:
> On Wed, 23 Jan 2008, Anton Salikhmetov wrote:
> > +
> > +   if (pte_dirty(*pte) && pte_write(*pte)) {
>
> Not correct.
>
> You still need to check "pte_present()" before you can test any other
> bits. For a non-present pte, none of the other bits are defined, and for
> all we know there might be architectures out there that require them to
> be non-dirty.
>
> As it is, you just possibly randomly corrupted the pte.
>
> Yeah, on all architectures I know of, it the pte is clear, neither of
> those tests will trigger, so it just happens to work, but it's still
> wrong.

Probably it can fail for !present nonlinear mappings on many
architectures.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] Minimal fix for private_list handling races

2008-01-23 Thread Nick Piggin

On Thursday 24 January 2008 00:30, Jan Kara wrote:
> On Wed 23-01-08 12:00:02, Nick Piggin wrote:
> > On Wednesday 23 January 2008 04:10, Jan Kara wrote:
> > >   Hi,
> > >
> > >   as I got no answer for a week, I'm resending this fix for races in
> > > private_list handling. Andrew, do you like them more than the previous
> > > version?
> >
> > FWIW, I reviewed this, and it looks OK although I think some comments
> > would be in order.
>
>   Thanks.
>
> > What would be really nice is to avoid the use of b_assoc_buffers
> > completely in this function like I've attempted (untested). I don't
> > know if you'd actually call that an improvement...?
>
>   I thought about this solution as well. But main issue I had with this
> solution is that currently, you nicely submit all the metadata buffers at
> once, so that block layer can sort them and write them in nice order. With
> the array you submit buffers by 16 (or any other fixed amount) and in
> mostly random order... So IMHO fsync would become measurably slower.

Oh, I don't know the filesystems very well... which ones would
attach a large number of metadata buffers to the inode?


> > Couple of things I noticed while looking at this code.
> >
> > - What is osync_buffers_list supposed to do? I couldn't actually
> >   work it out. Why do we care about waiting for these buffers on
> >   here that were added while waiting for writeout of other buffers
> >   to finish? Can we just remove it completely? I must be missing
> >   something.
>
>   The problem here is that mark_buffer_dirty_inode() can move the buffer
> from 'tmp' list back to private_list while we are waiting for another
> buffer...

Hmm, no not while we're waiting for another buffer because b_assoc_buffers
will not be empty. However it is possible between removing from the inode
list and insertion onto the temp list I think, because

  if (list_empty(&bh->b_assoc_buffers)) {

check in mark_buffer_dirty_inode is done without private_lock held. Nice.

With that in mind, doesn't your first patch suffer from a race due to
exactly this unlocked list_empty check when you are removing clean buffers
from the queue?

 if (!buffer_dirty(bh) && !buffer_locked(bh))
mark_buffer_dirty()
if (list_empty(&bh->b_assoc_buffers))
 /* add */
 __remove_assoc_queue(bh);

Which results in the buffer being dirty but not on the ->private_list,
doesn't it?

But let's see... there must be a memory ordering problem here in existing
code anyway, because I don't see any barriers. Between b_assoc_buffers and
b_state (via buffer_dirty); fsync_buffers_list vs mark_buffer_dirty_inode,
right?


> > - What are the get_bh(bh) things supposed to do? Protect the lifetime
> >   of a given bh while "lock" is dropped? That's nice, ignoring the
> >   fact that we brelse(bh) *before* taking the lock again... but isn't
> >   every single other buffer that we _have't_ elevated its reference
> >   exposed to exactly the same lifetime problem? IOW, either it is not
> >   required at all, or it is required for _all_ buffers? (my patch
> >   should fix this).
>
>   I think this get_bh() should stop try_to_free_buffers() from removing the
> buffer. brelse() before taking the private_lock is fine, because the loop
> actually checks for while (!list_empty(tmp)) so we really don't care what
> happens with the buffer after we are done with it. So I think that logic is
> actually fine.

Oh, of course. I overlooked the important fact that the tmp list is
also actually subject to modification via other threads exactly the
same as private_list...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: test case for the RODATA config option

2008-01-22 Thread Nick Piggin

On Wednesday 23 January 2008 09:44, Arjan van de Ven wrote:
> From: Arjan van de Ven <[EMAIL PROTECTED]>
> Subject: x86: test case for the RODATA config option
>
> This patch adds a test module for the DEBUG_RODATA config
> option to make sure change_page_attr() did indeed make
> "const" data read only.
>
> This testcase both tests the DEBUG_RODATA code as well as
> the change_page_attr() code for correct operation.
>
> When the tests/ patch gets merged, this module should move
> to the tests/ directory.
>
> Signed-off-by: Arjan van de Ven <[EMAIL PROTECTED]>
> ---
>  arch/x86/Kconfig.debug|8 +
>  arch/x86/kernel/Makefile_32   |1
>  arch/x86/kernel/Makefile_64   |2 +
>  arch/x86/kernel/test_rodata.c |   65
> ++ arch/x86/mm/init_32.c | 
>   3 +
>  arch/x86/mm/init_64.c |3 +
>  6 files changed, 82 insertions(+)
>
> Index: linux-2.6.24-rc8/arch/x86/Kconfig.debug
> ===
> --- linux-2.6.24-rc8.orig/arch/x86/Kconfig.debug
> +++ linux-2.6.24-rc8/arch/x86/Kconfig.debug
> @@ -57,6 +57,14 @@ config DEBUG_RODATA
> portion of the kernel code won't be covered by a 2MB TLB anymore.
> If in doubt, say "N".
>
> +config DEBUG_RODATA_TEST
> + tristate "Testcase for the DEBUG_RODATA feature"
> + depends on DEBUG_RODATA && m
> + help
> +   This option enables a testcase for the DEBUG_RODATA
> +   feature as well as for the change_page_attr() infrastructure.
> +   If in doubt, say "N"
> +
>  config 4KSTACKS
>   bool "Use 4Kb for kernel stacks instead of 8Kb"
>   depends on DEBUG_KERNEL
> Index: linux-2.6.24-rc8/arch/x86/mm/init_32.c
> ===
> --- linux-2.6.24-rc8.orig/arch/x86/mm/init_32.c
> +++ linux-2.6.24-rc8/arch/x86/mm/init_32.c
> @@ -790,6 +790,9 @@ static int noinline do_test_wp_bit(void)
>
>  #ifdef CONFIG_DEBUG_RODATA
>
> +const int rodata_test_data;
> +EXPORT_SYMBOL_GPL(rodata_test_data);
> +
>  void mark_rodata_ro(void)
>  {
>   unsigned long start = PFN_ALIGN(_text);
> Index: linux-2.6.24-rc8/arch/x86/mm/init_64.c
> ===
> --- linux-2.6.24-rc8.orig/arch/x86/mm/init_64.c
> +++ linux-2.6.24-rc8/arch/x86/mm/init_64.c
> @@ -590,6 +590,9 @@ void free_initmem(void)
>
>  #ifdef CONFIG_DEBUG_RODATA
>
> +const int rodata_test_data = 5;

I guess this should match the 32-bit case, and be zero instead of
5?

Can you disallow building as a module, and put this in the test
code? It could be run from the end of mark_rodata_ro()...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-01-22 Thread Nick Piggin

On Tuesday 22 January 2008 21:37, Ingo Molnar wrote:
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> > Well I've twice tried to submit a patch to print stacks for running
> > tasks as well, but nobody seems interested. It would at least give a
> > chance to see something.
>
> i definitely remembering having done this myself a couple of times (it
> makes tons of sense to get _some_ info out of the system) but some
> problem in -mm kept reverting it. I dont remember the specifics ... it
> was some race.

Hmm, that's not unlikely. But there is nothing in the backtrace code
which prevents a task from being woken up anyway, is there? I guess
it will be more common now, but if we find a race we can try to fix
the root cause.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] Minimal fix for private_list handling races

2008-01-22 Thread Nick Piggin

On Wednesday 23 January 2008 04:10, Jan Kara wrote:
>   Hi,
>
>   as I got no answer for a week, I'm resending this fix for races in
> private_list handling. Andrew, do you like them more than the previous
> version?

FWIW, I reviewed this, and it looks OK although I think some comments
would be in order.

What would be really nice is to avoid the use of b_assoc_buffers
completely in this function like I've attempted (untested). I don't
know if you'd actually call that an improvement...?

Couple of things I noticed while looking at this code.

- What is osync_buffers_list supposed to do? I couldn't actually
  work it out. Why do we care about waiting for these buffers on
  here that were added while waiting for writeout of other buffers
  to finish? Can we just remove it completely? I must be missing
  something.

- What are the get_bh(bh) things supposed to do? Protect the lifetime
  of a given bh while "lock" is dropped? That's nice, ignoring the
  fact that we brelse(bh) *before* taking the lock again... but isn't
  every single other buffer that we _have't_ elevated its reference
  exposed to exactly the same lifetime problem? IOW, either it is not
  required at all, or it is required for _all_ buffers? (my patch
  should fix this).

Hmm, now I remember why I rewrote this file :P
Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -792,47 +792,53 @@ EXPORT_SYMBOL(__set_page_dirty_buffers);
  */
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 {
+	struct buffer_head *batch[16];
+	int i, idx, done;
 	struct buffer_head *bh;
-	struct list_head tmp;
 	int err = 0, err2;
 
-	INIT_LIST_HEAD(&tmp);
-
+again:
 	spin_lock(lock);
+	idx = 0;
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
 		__remove_assoc_queue(bh);
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
-			list_add(&bh->b_assoc_buffers, &tmp);
-			if (buffer_dirty(bh)) {
-get_bh(bh);
-spin_unlock(lock);
-/*
- * Ensure any pending I/O completes so that
- * ll_rw_block() actually writes the current
- * contents - it is a noop if I/O is still in
- * flight on potentially older contents.
- */
-ll_rw_block(SWRITE, 1, &bh);
-brelse(bh);
-spin_lock(lock);
-			}
+			batch[idx++] = bh;
+			get_bh(bh);
 		}
+
+		if (idx == 16)
+			break;
 	}
+	done = list_empty(list);
+	spin_unlock(lock);
 
-	while (!list_empty(&tmp)) {
-		bh = BH_ENTRY(tmp.prev);
-		list_del_init(&bh->b_assoc_buffers);
-		get_bh(bh);
-		spin_unlock(lock);
+	for (i = 0; i < idx; i++) {
+		bh = batch[i];
+		if (buffer_dirty(bh)) {
+			/*
+			 * Ensure any pending I/O completes so
+			 * that ll_rw_block() actually writes
+			 * the current contents - it is a noop
+			 * if I/O is still in flight on
+			 * potentially older contents.
+			 */
+			ll_rw_block(SWRITE, 1, &bh);
+		}
+	}
+	for (i = 0; i < idx; i++) {
+		bh = batch[i];
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh))
 			err = -EIO;
 		brelse(bh);
-		spin_lock(lock);
 	}
+
+	idx = 0;
+	if (!done)
+		goto again;
 	
-	spin_unlock(lock);
 	err2 = osync_buffers_list(lock, list);
 	if (err)
 		return err;

Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-01-21 Thread Nick Piggin

On Tuesday 22 January 2008 16:03, Mike Galbraith wrote:
> On Tue, 2008-01-22 at 11:05 +1100, Nick Piggin wrote:
> > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > reader starts using 100% of CPU time after some time. When this
> > > happens, kill -9 does not work, and strace just hangs when trying to
> > > attach to the process. The same with gdb. ps shows the process as
> > > being in the R state.
> > >
> > > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > > Jan 21 21:45:01 Anastacia kernel: pan   R  running task   
> > > 0
> >
> > Well I've twice tried to submit a patch to print stacks for running
> > tasks as well, but nobody seems interested. It would at least give a
> > chance to see something.
>
> I've hit same twice recently (not pan, and not repeatable).

Nasty. The attached patch is something really simple that can sometimes help.
sysrq+p is also an option, if you're on a UP system.

Any luck getting traces?

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4920,8 +4920,7 @@ static void show_task(struct task_struct
 	printk(KERN_CONT "%5lu %5d %6d\n", free,
 		task_pid_nr(p), task_pid_nr(p->real_parent));
 
-	if (state != TASK_RUNNING)
-		show_stack(p, NULL);
+	show_stack(p, NULL);
 }
 
 void show_state_filter(unsigned long state_filter)

[patch] mm: fix PageUptodate data race

2008-01-21 Thread Nick Piggin


After running SetPageUptodate, preceeding stores to the page contents to
actually bring it uptodate may not be ordered with the store to set the page
uptodate.

Therefore, another CPU which checks PageUptodate is true, then reads the
page contents can get stale data.

Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
PageUptodate.

Many places that test PageUptodate, do so with the page locked, and this
would be enough to ensure memory ordering in those places if SetPageUptodate
were only called while the page is locked. Unfortunately that is not always
the case for some filesystems, but it could be an idea for the future.

Also bring the handling of anonymous page uptodateness in line with that of
file backed page management, by marking anon pages as uptodate when they _are_
uptodate, rather than when our implementation requires that they be marked as
such. Doing allows us to get rid of the smp_wmb's in the page copying
functions, which were especially added for anonymous pages for an analogous
memory ordering problem. Both file and anonymous pages are handled with the
same barriers.

FAQ:
Q. Why not do this in flush_dcache_page?
A. Firstly, flush_dcache_page handles only one side (the smb side) of the
ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
memory barriers in a completely unrelated function is nasty; at least in the
PageUptodate macros, they are located together with (half) the operations
involved in the ordering. Thirdly, the smp_wmb is only required when first
bringing the page uptodate, wheras flush_dcache_page should be called each time
it is written to through the kernel mapping. It is logically the wrong place to
put it.

Q. Why does this increase my text size / reduce my performance / etc.
A. Because it is adding the necessary instructions to eliminate the data-race.

Q. Can it be improved?
A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
run under the page lock, we could avoid the smp_rmb places where PageUptodate
is queried under the page lock. Requires audit of all filesystems and at least
some would need reworking. That's great you're interested, I'm eagerly awaiting
your patches.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/include/linux/highmem.h
===
--- linux-2.6.orig/include/linux/highmem.h
+++ linux-2.6/include/linux/highmem.h
@@ -68,8 +68,6 @@ static inline void clear_user_highpage(s
void *addr = kmap_atomic(page, KM_USER0);
clear_user_page(addr, vaddr, page);
kunmap_atomic(addr, KM_USER0);
-   /* Make sure this page is cleared on other CPU's too before using it */
-   smp_wmb();
 }
 
 #ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
@@ -160,8 +158,6 @@ static inline void copy_user_highpage(st
copy_user_page(vto, vfrom, vaddr, to);
kunmap_atomic(vfrom, KM_USER0);
kunmap_atomic(vto, KM_USER1);
-   /* Make sure this page is cleared on other CPU's too before using it */
-   smp_wmb();
 }
 
 #endif
Index: linux-2.6/include/linux/page-flags.h
===
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -131,16 +131,52 @@
 #define ClearPageReferenced(page)  clear_bit(PG_referenced, &(page)->flags)
 #define TestClearPageReferenced(page) test_and_clear_bit(PG_referenced, 
&(page)->flags)
 
-#define PageUptodate(page) test_bit(PG_uptodate, &(page)->flags)
+static inline int PageUptodate(struct page *page)
+{
+   int ret = test_bit(PG_uptodate, &(page)->flags);
+
+   /*
+* Must ensure that the data we read out of the page is loaded
+* _after_ we've loaded page->flags to check for PageUptodate.
+* We can skip the barrier if the page is not uptodate, because
+* we wouldn't be reading anything from it.
+*
+* See SetPageUptodate() for the other side of the story.
+*/
+   if (ret)
+   smp_rmb();
+
+   return ret;
+}
+
+static inline void __SetPageUptodate(struct page *page)
+{
+   smp_wmb();
+   __set_bit(PG_uptodate, &(page)->flags);
 #ifdef CONFIG_S390
+   page_clear_dirty(page);
+#endif
+}
+
 static inline void SetPageUptodate(struct page *page)
 {
+#ifdef CONFIG_S390
if (!test_and_set_bit(PG_uptodate, &page->flags))
page_clear_dirty(page);
-}
 #else
-#define SetPageUptodate(page)  set_bit(PG_uptodate, &(page)->flags)
+   /*
+* Memory barrier must be issued before setting the PG_uptodate bit,
+* so that all previous stores issued in order to bring the page
+* uptodate are actually visible before PageUptodate becomes true.
+*
+* s390 doesn't need an explicit smp_wmb here because the test and
+

Re: what's up for v2.6.25 in x86.git

2008-01-21 Thread Nick Piggin

On Tuesday 22 January 2008 12:13, Nick Piggin wrote:
> On Tuesday 22 January 2008 07:14, Ingo Molnar wrote:
> > Nick Piggin (5):
> >   mm: fix PageUptodate memory ordering bug
>
> This should actually be named differently. It should be
> called
>
> x86: don't unconditionally enable expensive SMP ppro workaround
>
> I actually had a more complete patch which printed a warning if
> booting such a system without the config option.

Ah sorry, here is a refreshed version
The selection of many CPU architecture families causes pentium pro memory
ordering errata workarounds to be enabled. This causes memory barriers and
spinlocks to become much more expensive, just to provide a few hacks for a very
rare (nowadays) class of system.

Just print a warning if such a machine is detected. Also suggest a new CONFIG option that can be enabled to support such CPUs.

This saves nearly 1KB of icache in mm/ alone.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/arch/x86/Kconfig.cpu
===
--- linux-2.6.orig/arch/x86/Kconfig.cpu
+++ linux-2.6/arch/x86/Kconfig.cpu
@@ -322,9 +322,21 @@ config X86_XADD
 	default y
 
 config X86_PPRO_FENCE
-	bool
+	bool "PentiumPro memory ordering errata workaround"
 	depends on M686 || M586MMX || M586TSC || M586 || M486 || M386 || MGEODEGX1
-	default y
+	default n
+	help
+ Old PentiumPro multiprocessor systems had errata that could cause
+ memory operations to violate the x86 ordering standard in rare cases.
+ Enabling this option will attempt to work around some (but not all)
+ occurances of these problems, at the cost of much heavier spinlock
+ and memory barrier operations.
+
+ If you say N here, these systems will be detected and limited to a
+ single CPU at boot time.
+
+ If unsure, say N here. Even distro kernels should think twice before
+ enabling this: there are few systems, and an unlikely bug.
 
 config X86_F00F_BUG
 	bool
Index: linux-2.6/arch/x86/kernel/cpu/intel.c
===
--- linux-2.6.orig/arch/x86/kernel/cpu/intel.c
+++ linux-2.6/arch/x86/kernel/cpu/intel.c
@@ -108,6 +108,32 @@ static void __cpuinit trap_init_f00f_bug
 }
 #endif
 
+/*
+ * Errata #66, #92, #51
+ */
+static void __cpuinit ppro_memory_bug(void)
+{
+#ifndef CONFIG_X86_PPRO_FENCE
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	boot_cpu_data.x86 == 6 &&
+	boot_cpu_data.x86_model == 1) {
+		if (boot_cpu_data.x86_mask < 8)
+			printk(KERN_WARNING "WARNING: Pentium Pro with "
+			 "Errata#66, #92, #51 detected. Running this kernel "
+			 "may cause inconsistent results. "
+			 "Enable CONFIG_X86_PPRO_FENCE.\n");
+
+#ifdef CONFIG_SMP
+		else
+			printk(KERN_WARNING "WARNING: Pentium Pro with "
+			 "Errata#66, #92, #51 detected. Running this kernel "
+			 "in an SMP system may cause inconsistent results. "
+			 "Enable CONFIG_X86_PPRO_FENCE if using SMP system.\n");
+#endif
+	}
+#endif
+}
+
 static void __cpuinit init_intel(struct cpuinfo_x86 *c)
 {
 	unsigned int l2 = 0;
@@ -132,6 +158,8 @@ static void __cpuinit init_intel(struct 
 	}
 #endif
 
+	ppro_memory_bug();
+
 	select_idle_routine(c);
 	l2 = init_intel_cacheinfo(c);
 	if (c->cpuid_level > 9 ) {

Re: what's up for v2.6.25 in x86.git

2008-01-21 Thread Nick Piggin

On Tuesday 22 January 2008 07:14, Ingo Molnar wrote:

> Nick Piggin (5):
>   mm: fix PageUptodate memory ordering bug

This should actually be named differently. It should be
called

x86: don't unconditionally enable expensive SMP ppro workaround

I actually had a more complete patch which printed a warning if
booting such a system without the config option.
The selection of many CPU architecture families causes pentium pro memory
ordering errata workarounds to be enabled. This causes memory barriers and
spinlocks to become much more expensive, just to provide a few hacks for a very
rare (nowadays) class of system.

Just print a warning if such a machine is detected. Also suggest a new CONFIG option that can be enabled to support such CPUs.

This saves nearly 1KB of icache in mm/ alone.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/arch/x86/Kconfig.cpu
===
--- linux-2.6.orig/arch/x86/Kconfig.cpu
+++ linux-2.6/arch/x86/Kconfig.cpu
@@ -322,9 +322,20 @@ config X86_XADD
 	default y
 
 config X86_PPRO_FENCE
-	bool
-	depends on M686 || M586MMX || M586TSC || M586 || M486 || M386 || MGEODEGX1
-	default y
+	bool "PentiumPro memory ordering errata workaround"
+	default n
+	help
+ Old PentiumPro multiprocessor systems had errata that could cause
+ memory operations to violate the x86 ordering standard in rare cases.
+ Enabling this option will attempt to work around some (but not all)
+ occurances of these problems, at the cost of much heavier spinlock
+ and memory barrier operations.
+
+ If you say N here, these systems will be detected and limited to a
+ single CPU at boot time.
+
+ If unsure, say N here. Even distro kernels should think twice before
+ enabling this: there are few systems, and an unlikely bug.
 
 config X86_F00F_BUG
 	bool
Index: linux-2.6/arch/x86/kernel/cpu/intel.c
===
--- linux-2.6.orig/arch/x86/kernel/cpu/intel.c
+++ linux-2.6/arch/x86/kernel/cpu/intel.c
@@ -108,6 +108,25 @@ static void __cpuinit trap_init_f00f_bug
 }
 #endif
 
+/*
+ * Errata #66, #92, #51
+ */
+static void __cpuinit ppro_memory_bug(void)
+{
+#ifndef CONFIG_X86_PPRO_FENCE
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	boot_cpu_data.x86 == 6 &&
+	boot_cpu_data.x86_model == 1) {
+		if (boot_cpu_data.x86_mask < 8)
+			printk(KERN_WARNING "WARNING: Pentium Pro with Errata#66, #92, #51 detected.");
+		else
+			printk(KERN_WARNING "WARNING: Pentium Pro with Errata#66, #92 detected.");
+
+		printk("running this kernel may cause inconsistent results and errors. Enable CONFIG_X86_PPRO_FENCE in your kernel config.\n");
+	}
+#endif
+}
+
 static void __cpuinit init_intel(struct cpuinfo_x86 *c)
 {
 	unsigned int l2 = 0;
@@ -132,6 +151,8 @@ static void __cpuinit init_intel(struct 
 	}
 #endif
 
+	ppro_memory_bug();
+
 	select_idle_routine(c);
 	l2 = init_intel_cacheinfo(c);
 	if (c->cpuid_level > 9 ) {

Re: [RFC PATCH 12/23 -v4] Use RCU algorithm for monotonic cycles.

2008-01-21 Thread Nick Piggin

On Tuesday 22 January 2008 02:22, Steven Rostedt wrote:
> From: john stultz <[EMAIL PROTECTED]>

>  static inline cycle_t
> -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> +clocksource_get_basecycles(struct clocksource *cs)
>  {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> - offset += cs->cycle_accumulated;
> + int num;
> + cycle_t now, offset;
> +
> + preempt_disable();
> + num = cs->base_num;
> + smp_read_barrier_depends();

All barriers need comments in the code. eg. with read barriers, the
comment should contain a list of the loads being ordered, and a
reference to the places where stores come from.

I know it isn't too hard to follow _now_, but it makes the code more
maintainable.


> + now = clocksource_read(cs);
> + offset = (now - cs->base[num].cycle_base_last);
> + offset &= cs->mask;
> + offset += cs->base[num].cycle_base;
> + preempt_enable();
> +
>   return offset;
>  }
>
> @@ -197,14 +215,26 @@ clocksource_get_cycles(struct clocksourc
>   * @now: current cycle value
>   *
>   * Used to avoids clocksource hardware overflow by periodically
> - * accumulating the current cycle delta. Must hold xtime write lock!
> + * accumulating the current cycle delta. Uses RCU-like update, but
> + * ***still requires the xtime_lock is held for writing!***
>   */
>  static inline void clocksource_accumulate(struct clocksource *cs, cycle_t
> now) {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> + /* First update the monotonic base portion.
> +  * The dual array update method allows for lock-free reading.
> +  */
> + int num = 1 - cs->base_num;
> + cycle_t offset = (now - cs->base[1-num].cycle_base_last);
> + offset &= cs->mask;
> + cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
> + cs->base[num].cycle_base_last = now;
> + wmb();
> + cs->base_num = num;

Ditto for the wmb. Also, I think the wmb() can probably just be
smp_wmb().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24 regression: pan hanging unkilleable and un-straceable

2008-01-21 Thread Nick Piggin

On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> reader starts using 100% of CPU time after some time. When this happens,
> kill -9 does not work, and strace just hangs when trying to attach to
> the process. The same with gdb. ps shows the process as being in the R
> state.
>
> I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> Jan 21 21:45:01 Anastacia kernel: pan   R  running task0 

Well I've twice tried to submit a patch to print stacks for running
tasks as well, but nobody seems interested. It would at least give a
chance to see something.

Can you post a few Sysrq+P traces?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ramdisk driver: make rd_size non-static

2008-01-18 Thread Nick Piggin

On Thu, Jan 17, 2008 at 08:39:23PM -0600, Matt Mackall wrote:
> 
> On Thu, 2008-01-17 at 18:28 -0800, Andrew Morton wrote:
> > On Fri, 18 Jan 2008 02:02:17 + Byron Bradley <[EMAIL PROTECTED]> wrote:
> > 
> > > In arch/arm/kernel/setup.c:setup_ramdisk(), rd_size is set from the
> > > boot tags. The replacement ramdisk driver has rd_size as static
> > > which causes linking to fail when ramdisk is built-in.
> > > 
> > 
> > but...
> > 
> > > diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> > > index 5ef1d26..8536480 100644
> > > --- a/drivers/block/brd.c
> > > +++ b/drivers/block/brd.c
> > > @@ -385,7 +385,7 @@ static struct block_device_operations brd_fops = {
> > >   * And now the modules code and kernel interface.
> > >   */
> > >  static int rd_nr;
> > > -static int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
> > > +int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
> > >  module_param(rd_nr, int, 0);
> > >  MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
> > >  module_param(rd_size, int, 0);
> > 
> > rd_size is a module parameter so it is settable via the
> > syntax-which-i-can-never-remember.  rd.rd_size=1024 or something like that.
> > 
> > If that's all sane, do we have some back-compat reason to continue to
> > support the special and duplicative rd_size parameter?
> 
> Only insofar as we're still supporting ramdisks in the first place.

I don't care about initrd or even a backward compatible API myself, I
do have my own reason want this new rd driver in the tree...

Would be nice to get rid of the arch stuff, but it's not too terrible
(at least from the POV of drivers/block/brd.c. So thanks for the patch,
Byron.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: runqueue locks in schedule()

2008-01-18 Thread Nick Piggin

On Friday 18 January 2008 17:33, stephane eranian wrote:
> Nick,

> > It is arch specific. If an architecture wants interrupts on during
> > context switch, or runqueue unlocked, then they set it (btw
> > INTERRUPTS_ON_CTXSW also implies UNLOCKED_CTXSW).
>
> Yes , I noticed that. I am only interested in UNLOCKED_CTXSW.
> But it appears that the approach suggested my Peter does work. We are
> running some tests.

OK, that might be OK.


> > Although, eg on x86, you would hold off interrupts and runqueue lock for
> > slightly less time if you defined those, it results in _slightly_ more
> > complicated context switching... although I did once find a workload
> > where the reduced runqueue contention improved throughput a bit, it is
> > not much problem in general to hold the lock.
>
> By complicated you mean that now you'd have to make sure you don't
> need to access runqueue data?

Well, not speaking about the arch-specific code (which may involve
more complexities), but the core scheduler needs the
task_struct->oncpu variable wheras that isn't required if the
runqueue is locked while switching tasks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: runqueue locks in schedule()

2008-01-17 Thread Nick Piggin

On Friday 18 January 2008 00:24, Peter Zijlstra wrote:
> [ At the very least CC'ing the scheduler maintainer would be
> helpful :-) ]
>
> On Wed, 2008-01-16 at 16:29 -0800, stephane eranian wrote:
> > Hello,
> >
> > As suggested by people on this list, I have changed perfmon2 to use
> > the high resolution timers as the interface to allow timeout-based
> > event set multiplexing. This works around the problems I had with
> > tickless-enabled kernels.
> >
> > Multiplexing is supported in per-thread as well. In that case, the
> > timeout measures virtual time. When the thread is context switched
> > out, we need to save the remainder of the timeout and cancel the
> > timer. When the thread is context switched in, we need to reinstall
> > the timer. These timer save/restore operations have to be done in the
> > switch_to() code near the end of schedule().
> >
> > There are situations where hrtimer_start() may end up trying to
> > acquire the runqueue lock. This happens on a context switch where the
> > current thread is blocking (not preempted) and the new timeout happens
> > to be either in the past or just expiring. We've run into such
> > situations with simple tests.
> >
> > On all architectures, but IA-64, it seems thet the runqueue lock is
> > held until the end of schedule(). On IA-64, the lock is released
> > BEFORE switch_to() for some reason I don't quite remember. That may
> > not even be needed anymore.
> >
> > The early unlocking is controlled by a macro named
> > __ARCH_WANT_UNLOCKED_CTXSW. Defining this macros on X86 (or PPC) fixed
> > our problem.
> >
> > It is not clear to me why the runqueue lock needs to be held up until
> > the end of schedule() on some platforms and not on others. Not that
> > releasing the lock earlier does not necessarily introduce more
> > overhead because the lock is never re-acquired later in the schedule()
> > function.
> >
> > Question:
> >- is it safe to release the lock before switch_to() on all
> > architectures?
>
> I had similar problem when using hrtimers from the scheduler, I extended
> the HRTIMER_CB_IRQSAFE_NO_SOFTIRQ time type to run with cpu_base->lock
> unlocked.
>
> http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a
>=commitdiff;h=7e7cbd617833dde5b442e03f69aac39d17d02ec7
> http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a
>=commitdiff;h=45d10aad580a5cdd376e80848aeeaaaf1f97cc18
> http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a
>=commitdiff;h=5ae5d6c5850d4735798bc0e4526d8c61199e9f93
>
> As for your __ARCH_WANT_UNLOCKED_CTXSW question I have to defer to Ingo,
> as I'm unaware of the arch ramifications there.

It is arch specific. If an architecture wants interrupts on during context
switch, or runqueue unlocked, then they set it (btw INTERRUPTS_ON_CTXSW
also implies UNLOCKED_CTXSW).

Although, eg on x86, you would hold off interrupts and runqueue lock for
slightly less time if you defined those, it results in _slightly_ more
complicated context switching... although I did once find a workload
where the reduced runqueue contention improved throughput a bit, it is
not much problem in general to hold the lock.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1974 matches

Mail list logo