Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
On Mon, Aug 13, 2007 at 06:30:00PM +0800, Fengguang Wu wrote: > > On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote: > > > Miklos Szeredi <[EMAIL PROTECTED]> and me identified a writeback bug: > > > Basicly they are > > > - during the dd: ~16M > > > - after 30s: ~4M > > > - after 5s: ~4M > > > - after 5s: ~176M > > > > > > The box has 2G memory. > > > > > > Question 1: > > > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s > > > delays. > > > > pdflush runs every five seconds, so that is indicative of the inode being > > written once for 1024 pages, and then delayed to the next pdflush run 5s > > later. > > perhaps the inodes aren't moving between the lists exactly the way you > > think they are... > > Now I figured out the exact situation. When the scan of s_io finishes > with some small inodes, nr_to_write will be positive, fooling kupdate > to quit prematurely. But in fact the big dirty file is on s_more_io > waiting for more io... The attached patch fixes it. > > Fengguang > === > > Subject: writeback: introduce writeback_control.more_io to indicate more io > > After making dirty a 100M file, the normal behavior is to > start the writeback for all data after 30s delays. But > sometimes the following happens instead: > > - after 30s:~4M > - after 5s: ~4M > - after 5s: all remaining 92M > > Some analyze shows that the internal io dispatch queues goes like this: > > s_ios_more_io > - > 1) 100M,1K 0 > 2) 1K 96M > 3) 0 96M > > 1) initial state with a 100M file and a 1K file > 2) 4M written, nr_to_write <= 0, so write more > 3) 1K written, nr_to_write > 0, no more writes(BUG) > > nr_to_write > 0 in 3) fools the upper layer to think that data have all been > written out. Bug the big dirty file is still sitting in s_more_io. We cannot > simply splice s_more_io back to s_io as soon as s_io becomes empty, and let > the > loop in generic_sync_sb_inodes() continue: this may starve newly expired > inodes > in s_dirty. It is also not an option to draw inodes from both s_more_io and > s_dirty, an let the loop go on: this might lead to live locks, and might also > starve other superblocks in sync time(well kupdate may still starve some > superblocks, that's another bug). > > So we have to return when a full scan of s_io completes. So nr_to_write > 0 > does > not necessarily mean that "all data are written". This patch introduces a flag > writeback_control.more_io to indicate this situation. With it the big dirty > file > no longer has to wait for the next kupdate invocation 5s later. Sorry, this patch is found to be dangerous. It locks up my desktop on heavy I/O: kupdate *immediately* returns to push the file in s_more_io for writeback, but it *could* still not able to make progress(locks etc.). Now kupdate ends up *busy looping*. That could be fixed by wait for somehow 100ms and retry the io. Should we do it?(or: Is 5s interval considered too long a wait?) > Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]> > --- > fs/fs-writeback.c |2 ++ > include/linux/writeback.h |1 + > mm/page-writeback.c |9 ++--- > 3 files changed, 9 insertions(+), 3 deletions(-) > > --- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c > +++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c > @@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_ > if (wbc->nr_to_write <= 0) > break; > } > + if (!list_empty(&sb->s_more_io)) > + wbc->more_io = 1; > spin_unlock(&inode_lock); > return ret; /* Leave any unwritten inodes on s_io */ > } > --- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h > +++ linux-2.6.23-rc2-mm2/include/linux/writeback.h > @@ -61,6 +61,7 @@ struct writeback_control { > unsigned for_reclaim:1; /* Invoked from the page allocator */ > unsigned for_writepages:1; /* This is a writepages() call */ > unsigned range_cyclic:1;/* range_start is cyclic */ > + unsigned more_io:1; /* more io to be dispatched */ > > void *fs_private; /* For use by ->writepages() */ > }; > --- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c > +++ linux-2.6.23-rc2-mm2/mm/page-writeback.c > @@ -382,6 +382,7 @@ static void background_writeout(unsigned > global_page_state(NR_UNSTABLE_NFS) < background_thresh > && min_pages <= 0) > break; > + wbc.more_io = 0; > wbc.encountered_congestion = 0; > wbc.nr_to_write = MAX_WRITEBACK_PAGES; > wbc.pages_skipped = 0; > @@ -389,8 +390,9 @@ static void background_writeout(unsigned > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > if (wbc.nr_to_write > 0 ||
Re: [PATCH 002 of 6] Introduce rq_for_each_segment replacing rq_for_each_bio
On Thu, 16 Aug 2007, NeilBrown wrote: > Every usage of rq_for_each_bio wraps a usage of > bio_for_each_segment, so these can be combined into > rq_for_each_segment. > > We define "struct req_iterator" to hold the 'bio' and 'index' that > are needed for the double iteration. > --- .prev/drivers/block/ps3disk.c 2007-08-16 20:37:26.0 +1000 > +++ ./drivers/block/ps3disk.c 2007-08-16 20:50:07.0 +1000 > @@ -91,30 +91,30 @@ static void ps3disk_scatter_gather(struc > struct request *req, int gather) > { > unsigned int offset = 0; > - struct bio *bio; > - sector_t sector; > + struct req_iterator iter; > struct bio_vec *bvec; > - unsigned int i = 0, j; > + unsigned int i = 0; > size_t size; > void *buf; > > - rq_for_each_bio(bio, req) { > - sector = bio->bi_sector; > + rq_for_each_segment(bvec, req, iter) { > + unsigned long flags; > dev_dbg(&dev->sbd.core, > "%s:%u: bio %u: %u segs %u sectors from %lu\n", > - __func__, __LINE__, i, bio_segments(bio), > - bio_sectors(bio), sector); > - bio_for_each_segment(bvec, bio, j) { > + __func__, __LINE__, i, bio_segments(iter.bio), > + bio_sectors(iter.bio), > + (unsigned long)iter.bio->bi_sector); ^^^ Superfluous cast: PS3 is 64-bit only, and casts are evil. > + > size = bvec->bv_len; > - buf = __bio_kmap_atomic(bio, j, KM_IRQ0); > + buf = bvec_kmap_irq(bvec, flags); As you already mentioned in the preample of the patch series, this changes functionality. > if (gather) > memcpy(dev->bounce_buf+offset, buf, size); > else > memcpy(buf, dev->bounce_buf+offset, size); > offset += size; > flush_kernel_dcache_page(bio_iovec_idx(bio, > j)->bv_page); > - __bio_kunmap_atomic(bio, KM_IRQ0); > - } > + bio_kunmap_bvec(bvec, flags); > + > i++; > } > } With kind regards, Geert Uytterhoeven Software Architect Sony Network and Software Technology Center Europe The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium Phone:+32 (0)2 700 8453 Fax: +32 (0)2 700 8622 E-mail: [EMAIL PROTECTED] Internet: http://www.sony-europe.com/ Sony Network and Software Technology Center Europe A division of Sony Service Centre (Europe) N.V. Registered office: Technologielaan 7 · B-1840 Londerzeel · Belgium VAT BE 0413.825.160 · RPR Brussels Fortis Bank Zaventem · Swift GEBABEBB08A · IBAN BE39001382358619
[Announce] RHEL5 LSPP/EAL4 Certification Testsuite has been released
Dear All, The Red Hat Enterprise Linux 5 LSPP/EAL4 Certification Testsuite has been released and can be found at http://ltp.sourceforge.net/ or http://sourceforge.net/project/showfiles.php?group_id=3382. You can find more details about this testsuite from 'lspp/README' contained in the package. Regards & Thanks-- Subrata Modak - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/23] per device dirty throttling -v9
On Thu, 2007-08-16 at 14:29 -0700, Christoph Lameter wrote: > Is there any way to make the global limits on which the dirty rate > calculations are based cpuset specific? > > A process is part of a cpuset and that cpuset has only a fraction of > memory of the whole system. > > And only a fraction of that fraction can be dirtied. We do not currently > enforce such limits which can cause the amount of dirty pages in > cpusets to become excessively high. I have posted several patchsets that > deal with that issue. See http://lkml.org/lkml/2007/1/16/5 > > It seems that limiting dirty pages in cpusets may be much easier to > realize in the context of this patchset. The tracking of the dirty pages > per node is not necessary if one would calculate the maximum amount of > dirtyable pages in a cpuset and use that as a base, right? Currently we do: dirty = total_dirty * bdi_completions_p * task_dirty_p As dgc pointed out before, there is the issue of bdi/task correlation, that is, we do not track task dirty rates per bdi, so now a task that heavily dirties on one bdi will also get penalised on the others (and similar issues). If we were to change it so: dirty = cpuset_dirty * bdi_completions_p * task_dirty_p We get additional correlation issues: cpuset/bdi, cpuset/task. Which could yield surprising results if some bdis are strictly per cpuset. The cpuset/task correlation has a strict mapping and could be solved by keeping the vm_dirties counter per cpuset. However, this would seriously complicate the code and I'm not sure if it would gain us much. Anyway, things to ponder. But overall it should be quite doable. signature.asc Description: This is a digitally signed message part
Re: Fork Bombing Patch
Petr wrote: > Please do not add comments inside functions. I find this advice a bit odd. I am not aware of any prohibition of comments inside functions. As with comments outside functions, they should serve a worthwhile purpose, of course. One might debate whether this particular comment added by Anand was sufficiently valuable to be worth having. But I don't agree to a blanket prohibition on comments inside functions. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 000 of 6] A few block-layer tidy-up patches.
On Friday August 17, [EMAIL PROTECTED] wrote: > > > Please inspect the #block-2.6.24 branch to see the result. > > > > I don't know where to look for this. I checked > > http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git > > but they don't seem to be there. > > ?? > > That's where it is, but the kernel.org mirroring is just horrible slow. > Just checked now and it's there. I discovered I was looking in the wrong place - not being very familiar with git terminology. I found them and it looks right. I had a bit of a look at removing bio_data and ->buffer ... the usages outside of drivers/ide are fairly easy to deal with - I might send a patch for that. The drivers/ide stuff looks like a fairly substantial rewrite is in order. e.g. idefloppy_packet_command_s seems to duplicate a lot of fields from 'struct request', and it should probably use the request struct directly. But a number of times ->buffer points to ->cmd, and there is no bio. I guess we should use bio_map_kern to make a bio? I'll see if I can come up with something testing it might be awkward. I have an ide cdrom I can test on. Maybe an ide disk,, but not an ide floppy :-) NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nanosleep() accuracy
GolovaSteek skrev: Hello! I need use sleep with accurat timing. I use 2.6.21 with rt-prempt patch. with enabled rt_preempt, dyn_ticks, and local_apic But req.tv_nsec = 30; req.tv_sec = 0; nanosleep(&req,NULL) make pause around 310-330 microseconds. How do you measure this? If you want to have something done every 300 microseconds, you must not sleep for 300 microseconds in each iteration, because you'd accumulate errors. Use a periodic timer or use the current time to compute how long to sleep in each iteration. Take a look how cyclictest does it. I tried to understend how work nanosleep(), but it not depends from jiffies and from smp_apic_timer_interrupt. When can accuracy be lost? And how are process waked up? GolovaSteek Don't forget the process will always have non-zero wakeup latency. It takes some time to process an interrupt, wakeup the process and schedule it to run on the CPU. 10-30 microseconds is not unreasonable. Michal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: #define atomic_read_volatile(v) \ ({ \ forget((v)->counter);\ ((v)->counter); \ }) where: *vomit* :) Not only do I hate the keyword volatile, but the barrier is only a one-sided affair so its probable this is going to have slightly different allowed reorderings than a real volatile access. Also, why would you want to make these insane accessors for atomic_t types? Just make sure everybody knows the basics of barriers, and they can apply that knowledge to atomic_t and all other lockless memory accesses as well. #define forget(a) __asm__ __volatile__ ("" :"=m" (a) :"m" (a)) I like order(x) better, but it's not the most perfect name either. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Thu, 16 Aug 2007, Paul E. McKenney wrote: > On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote: > > On Thu, Aug 16, 2007 at 09:34:41AM -0700, Paul E. McKenney wrote: > > > > > > The compiler can also reorder non-volatile accesses. For an example > > > patch that cares about this, please see: > > > > > > http://lkml.org/lkml/2007/8/7/280 > > > > > > This patch uses an ORDERED_WRT_IRQ() in rcu_read_lock() and > > > rcu_read_unlock() to ensure that accesses aren't reordered with respect > > > to interrupt handlers and NMIs/SMIs running on that same CPU. > > > > Good, finally we have some code to discuss (even though it's > > not actually in the kernel yet). > > There was some earlier in this thread as well. Hmm, I never quite got what all this interrupt/NMI/SMI handling and RCU business you mentioned earlier was all about, but now that you've pointed to the actual code and issues with it ... > > First of all, I think this illustrates that what you want > > here has nothing to do with atomic ops. The ORDERED_WRT_IRQ > > macro occurs a lot more times in your patch than atomic > > reads/sets. So *assuming* that it was necessary at all, > > then having an ordered variant of the atomic_read/atomic_set > > ops could do just as well. > > Indeed. If I could trust atomic_read()/atomic_set() to cause the compiler > to maintain ordering, then I could just use them instead of having to > create an ORDERED_WRT_IRQ(). (Or ACCESS_ONCE(), as it is called in a > different patch.) +#define WHATEVER(x)(*(volatile typeof(x) *)&(x)) I suppose one could want volatile access semantics for stuff that's a bit-field too, no? Also, this gives *zero* "re-ordering" guarantees that your code wants as you've explained it below) -- neither w.r.t. CPU re-ordering (which probably you don't care about) *nor* w.r.t. compiler re-ordering (which you definitely _do_ care about). > > However, I still don't know which atomic_read/atomic_set in > > your patch would be broken if there were no volatile. Could > > you please point them out? > > Suppose I tried replacing the ORDERED_WRT_IRQ() calls with > atomic_read() and atomic_set(). Starting with __rcu_read_lock(): > > o If "ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++" > was ordered by the compiler after > "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1", then > suppose an NMI/SMI happened after the rcu_read_lock_nesting but > before the rcu_flipctr. > > Then if there was an rcu_read_lock() in the SMI/NMI > handler (which is perfectly legal), the nested rcu_read_lock() > would believe that it could take the then-clause of the > enclosing "if" statement. But because the rcu_flipctr per-CPU > variable had not yet been incremented, an RCU updater would > be within its rights to assume that there were no RCU reads > in progress, thus possibly yanking a data structure out from > under the reader in the SMI/NMI function. > > Fatal outcome. Note that only one CPU is involved here > because these are all either per-CPU or per-task variables. Ok, so you don't care about CPU re-ordering. Still, I should let you know that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you want is a full compiler optimization barrier(). [ Your code probably works now, and emits correct code, but that's just because of gcc did what it did. Nothing in any standard, or in any documented behaviour of gcc, or anything about the real (or expected) semantics of "volatile" is protecting the code here. ] > o If "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1" > was ordered by the compiler to follow the > "ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx", and an NMI/SMI > happened between the two, then an __rcu_read_lock() in the NMI/SMI > would incorrectly take the "else" clause of the enclosing "if" > statement. If some other CPU flipped the rcu_ctrlblk.completed > in the meantime, then the __rcu_read_lock() would (correctly) > write the new value into rcu_flipctr_idx. > > Well and good so far. But the problem arises in > __rcu_read_unlock(), which then decrements the wrong counter. > Depending on exactly how subsequent events played out, this could > result in either prematurely ending grace periods or never-ending > grace periods, both of which are fatal outcomes. > > And the following are not needed in the current version of the > patch, but will be in a future version that either avoids disabling > irqs or that dispenses with the smp_read_barrier_depends() that I > have 99% convinced myself is unneeded: > > o nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting); > > o idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1; > > Furthermore, in that future version, irq handlers can cause the same > mischief that SMI/NMI handlers can in this version. > > Next, looking
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Nick Piggin wrote: > I don't know why people would assume volatile of atomics. AFAIK, most > of the documentation is pretty clear that all the atomic stuff can be > reordered etc. except for those that modify and return a value. Which documentation is there? For driver authors, there is LDD3. It doesn't specifically cover effects of optimization on accesses to atomic_t. For architecture port authors, there is Documentation/atomic_ops.txt. Driver authors also can learn something from that document, as it indirectly documents the atomic_t and bitops APIs. Prompted by this thread, I reread this document, and indeed, the sentence "Unlike the above routines, it is required that explicit memory barriers are performed before and after [atomic_{inc,dec}_return]" indicates that atomic_read (one of the "above routines") is very different from all other atomic_t accessors that return values. This is strange. Why is it that atomic_read stands out that way? IMO this API imbalance is quite unexpected by many people. Wouldn't it be beneficial to change the atomic_read API to behave the same like all other atomic_t accessors that return values? OK, it is also different from the other accessors that return data in so far as it doesn't modify the data. But as driver "author", i.e. user of the API, I can't see much use of an atomic_read that can be reordered and, more importantly, can be optimized away by the compiler. Sure, now that I learned of these properties I can start to audit code and insert barriers where I believe they are needed, but this simply means that almost all occurrences of atomic_read will get barriers (unless there already are implicit but more or less obvious barriers like msleep). -- Stefan Richter -=-=-=== =--- =---= http://arcgraph.de/sr/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC 3/4]Introduce "account modifiers" mechanism
Rusty Russell wrote: > On Thu, 2007-08-16 at 17:58 +0200, Laurent Vivier wrote: >> [PATCH 3/3] introduce "account modifiers" mechanism in the kernel allowing a >> module to modify the collected accounting for a given task. This >> implementation >> is based on the "preempt_notifier". "account_system_time()" and >> "account_user_time()" can call functions registered by a module to modify the >> cputime value. >> >> Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]> > > > Hi Laurent, Hi Rusty, how are your puppies ? And thank you for your comment. > This seems a little like overkill. Why not just add an > "account_guest_time" which subtracts the given amount of time from > system time (if available) and adds it to guest time? Then kvm (and > lguest) should just need to call this at the right times. We can. I did something like this before. By doing like that, I think there is a major issue: system time can be decreasing (as we substract a value from it), and thus we can have negative value in a tool like top. It's why I play with the cputime to add to system time and not directly with the system time. BUT I'm very open, my only goal is be able to compute guest time, "how" is not very important... what we can do: - keep PATCHES 1 and 2, because we need to store guest time for cpu and tasks. It is very generic. - remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate guest time (by calling something like guest_enter() and guest_exit() from the virtualization engine), and when in account_system_time() we have cputime > vtime we substrate vtime from cputime and add vtime to user time and guest time. But doing like this we freeze in kernel/sched.c the link between system time, user time and guest time (i.e. system time = system time - vtime, user time = user time + vtime and guest time = guest time + vtime). - modify PATCH 4 to use new PATCH 3. Do you agree ? Anybody doesn't agree ? Laurent -- - [EMAIL PROTECTED] -- "Software is hard" - Donald Knuth signature.asc Description: OpenPGP digital signature
Re: Fork Bombing Patch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul Jackson wrote: > Petr wrote: >> Please do not add comments inside functions. > > I find this advice a bit odd. I am not aware of > any prohibition of comments inside functions. > > As with comments outside functions, they should > serve a worthwhile purpose, of course. One might > debate whether this particular comment added by > Anand was sufficiently valuable to be worth > having. > > But I don't agree to a blanket prohibition on > comments inside functions. I'm not saying that comments inside functions should be prohibited, but comments inside functions often lead to over-commenting. There must be a good reason for adding such a comment. See CodingStyle, chapter 8: Commenting: Also, try to avoid putting comments inside a function body. (Before somebody starts arguing with this one sentence, please also read the rest of the chapter; it is not long and you'll understand the author's intention better.) Kind regards, Petr Tesarik -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGxVFLjpY2ODFi2ogRAoLUAJwI+cywi9iKHWlx4yora0+WJfCEawCglyrf xyucPIB3W63sbM1dw/Nsv2Y= =SL8f -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] Re: [gfs2][RFC] readdir caused ls process into D (uninterruptible) state, under testing with Samba 3.0.25
On 8/16/07, Steven Whitehouse <[EMAIL PROTECTED]> wrote: > Hi, > > On Thu, 2007-08-16 at 16:20 +0800, 程任全 wrote: > > It seems that gfs2 cannot work well with Samba, > > > > I'm using the gfs2 and the new cluster suite(cman with openais), > > > > 1. the testing environment is that 1 iscsi target and 2 cluster node, > > 2. the two nodes both used iscsi initiator connect to the target, > > 3. they're using the same physical iscsi disk, > > 4. run LVM2 on top of the same iscsi disk, > > 5. on the same lv (logical volume), I created a gfs2 filesystem, > > 6. mount the gfs2 system to a same path under 2 nodes, > > 7. start samba to shared the gfs2 mounting pointer on the 2 nodes, > > > > now test with windows client, when two or above clients connects to the > > samba, > > everything is still normal; but when heavy writers or readers start, > > the samba server daemon changed to D state, that's uninterruptible in > > the kernel, > > I wonder that's a problem of gfs2? > > > Which version of gfs2 are you using? GFS2 doesn't support leases which I > know that Samba uses, however only relatively recent kernels have been > able to report that fact via the VFS. > > > then I start a simple ls command on the gfs2 mouting point: > > $ ls /mnt/gfs2 > > the ls process is also changed to D state, > > > > I think it's problems about readdir implementation in gfs2, and I want > > to fix it, someone could give me some pointers? > > > Can you get a stack trace? echo 't' >/proc/sysrq-trigger > That should show where Samba is getting stuck, > > Steve. the stack trace of the 'D' state `ls`: === lsD F89B83F8 2200 12018 1 (NOTLB) f3eeadd4 0082 f6a425c0 f89b83f8 f3eead9c f6a425d4 f6f32d80 f573a93c 0001 f89b83f3 c40a2030 c3fa9fa0 c40aaa70 c40aab7c 0e89 b2a4b036 02e4 c40a2030 f3eeae1c c3f85e98 f8e11e09 f8e11e0e Call Trace: [] gdlm_bast+0x0/0x93 [lock_dlm] [] gdlm_ast+0x0/0x5 [lock_dlm] [] holder_wait+0x0/0x8 [gfs2] [] holder_wait+0x5/0x8 [gfs2] [] __wait_on_bit+0x2c/0x51 [] out_of_line_wait_on_bit+0x6f/0x77 [] holder_wait+0x0/0x8 [gfs2] [] wake_bit_function+0x0/0x3c [] wake_bit_function+0x0/0x3c [] wait_on_holder+0x3c/0x40 [gfs2] [] glock_wait_internal+0x81/0x1a3 [gfs2] [] gfs2_glock_nq+0x5e/0x79 [gfs2] [] gfs2_getattr+0x72/0xb5 [gfs2] [] gfs2_getattr+0x6b/0xb5 [gfs2] [] do_path_lookup+0x17a/0x1c3 [] gfs2_getattr+0x0/0xb5 [gfs2] [] vfs_getattr+0x3e/0x51 [] vfs_lstat_fd+0x2b/0x3d [] do_path_lookup+0x17a/0x1c3 [] mntput_no_expire+0x11/0x6e [] sys_lstat64+0xf/0x23 [] sys_symlinkat+0x81/0xb5 [] sysenter_past_esp+0x5d/0x81 [] __ipv6_addr_type+0x88/0xb8 the system is still running, so the mormal 'R' and 'S' state process are ignored, But it turns out that it's not the readdir's fault from this call trace, but gdlm_bast's problem in lock_dlm module. > > > -- Denis Cheng - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nanosleep() accuracy
2007/8/17, Michal Schmidt <[EMAIL PROTECTED]>: > GolovaSteek skrev: > > Hello! > > I need use sleep with accurat timing. > > I use 2.6.21 with rt-prempt patch. > > with enabled rt_preempt, dyn_ticks, and local_apic > > But > > > > req.tv_nsec = 30; > > req.tv_sec = 0; > > nanosleep(&req,NULL) > > > > make pause around 310-330 microseconds. > > How do you measure this? > If you want to have something done every 300 microseconds, you must not > sleep for 300 microseconds in each iteration, because you'd accumulate > errors. Use a periodic timer or use the current time to compute how long > to sleep in each iteration. Take a look how cyclictest does it. no. I just want my programm go to sleep sometimes and wake up in correct time. > > I tried to understend how work nanosleep(), but it not depends from > > jiffies and from smp_apic_timer_interrupt. > > > > When can accuracy be lost? > > And how are process waked up? > > > > > > GolovaSteek > > Don't forget the process will always have non-zero wakeup latency. It > takes some time to process an interrupt, wakeup the process and schedule > it to run on the CPU. 10-30 microseconds is not unreasonable. But 2 operations can be done in 10 microseconds? and why is there that inconstancy? Why sametimes 10 and sometimes 30? In which points of implementation it happens? GolovaSteek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Only initialize hvc_console if needed, cleanup Kconfig help
On Fri, 17 Aug 2007 11:25:46 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote: > > diff -r 0730da2377be drivers/char/Kconfig > --- a/drivers/char/KconfigTue Aug 14 12:46:08 2007 +1000 > +++ b/drivers/char/KconfigFri Aug 17 09:05:12 2007 +1000 > @@ -568,10 +568,10 @@ config HVC_DRIVER > config HVC_DRIVER > bool > help > - Users of pSeries machines that want to utilize the hvc console > front-end > - module for their backend console driver should select this option. > - It will automatically be selected if one of the back-end console > drivers > - is selected. > + Generic "hypervisor virtual console" infrastructure for various > + hypervisors (pSeries, Xen, lguest). ^^^ You left out iSeries. Probably better to just not enumerate them as they may increase over time. Otherwise looks good (though I haven't actually built/booted it). -- Cheers, Stephen Rothwell[EMAIL PROTECTED] http://www.canb.auug.org.au/~sfr/ pgpLa3V67ZBdQ.pgp Description: PGP signature
Re: Early printk behaviour
Mike Frysinger wrote: >> Hmm, sort of, although I didn't think about the case of no real console >> replacing the early console. The intention of the patch is to have a >> smooth handover from the boot console to the real console. And, yes, if >> no real console is ever registered the boot console keeps running ... > > i think it also occurs in the case where real console != early console No. At least not of the boot console has the CON_BOOT flag set as it should. Last message you'll see on the boot console is the handover printk, telling you which real console device prints the following messages. Whenever early and real console go to the physical device or not doesn't matter. >> So you can either let it running and *not* mark it __init, so it can >> keep on going without breaking. Or you can explicitly unregister your >> boot console at some point, maybe using a late_initcall. > > wouldnt a common kernel late_initcall() be more appropriate ? if > early console hasnt switched over (for whatever reason), then kill it Hmm, yes, should be doable in generic code. Check whenever the current console has CON_BOOT set and if so unregister it. cheers, Gerd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nanosleep() accuracy
On Aug 17 2007 11:44, GolovaSteek wrote: >> How do you measure this? >> If you want to have something done every 300 microseconds, you must not >> sleep for 300 microseconds in each iteration, because you'd accumulate >> errors. Use a periodic timer or use the current time to compute how long >> to sleep in each iteration. Take a look how cyclictest does it. > >no. I just want my programm go to sleep sometimes and wake up in correct time. Would it be acceptable to use an optimistic strategy, like the one below? Let's say that the following tasks happen at each time: A at 0, B at 300, C at 600, D at 900, E at 1200, F at 1500. Assume sleeping takes 500 µs. Then B and C could be run at 500, D at 1000 and E,F at 1500. Jan --
Re: Need help with modules loading
On 8/17/07, Larry Finger <[EMAIL PROTECTED]> wrote: > A new driver for the Broadcom BCM43xx devices has been written that uses > mac80211, rather than > softmac. The newest versions of the Broadcom firmware does not support all > the BCM devices. > Accordingly, a separate driver is being prepared that will use an older > version of the firmware and > support these legacy devices. Unfortunately, there is not a clean separation > based on PCI id's; > however, the revision level of the 802.11 wireless core can be used to > determine which driver should > be used. The scheme works on most systems, but not mine and I need some help > to discover why. > The 'MODALIAS=ssb:v4243id0812rev0A' line is correct for my device. In fact > issuing a modprobe > "ssb:v4243id0812rev0A" command results in the loading of the module. For some > reason, this does not > happen automatically. > > Initially, I suspected that my version of udev (103-13) was too old; however, > upgrading to version > 114 did not help. My module-init-tools are V 3.2.2 and my distro is the > x86_64 version of openSUSE 10.2. openSUSE 10.2 used a whitelist of buses which trigger module loading. It's in the udev sysconfig. rules and /sbin/hwup. The easiest is probably to add a rule for that bus: ACTION=="add", SUBSYSTEM=="ssb", ENV{MODALIAS}=="?*", RUN+="/sbin/modprobe $env{MODALIAS}" openSUSE 10.3 will call modprobe directly, the whitelist and the whole hwup logic is removed in the meantime. Kay - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sysfs: don't warn on removal of a nonexistent binary file
On Fri, 17 Aug 2007 10:19:24 +0900, Tejun Heo <[EMAIL PROTECTED]> wrote: > Alan Stern wrote: > > This patch (as960) removes the error message and stack dump logged by > > sysfs_remove_bin_file() when someone tries to remove a nonexistent > > file. The warning doesn't seem to be needed, since none of the other > > file-, symlink-, or directory-removal routines in sysfs complain in a > > comparable way. > > > > Signed-off-by: Alan Stern <[EMAIL PROTECTED]> > > Acked-by: Tejun Heo <[EMAIL PROTECTED]> Acked-by: Cornelia Huck <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: annotate rcu_read_{,un}lock()
On Thu, 2007-08-16 at 09:01 -0700, Paul E. McKenney wrote: > On Thu, Aug 16, 2007 at 04:25:07PM +0200, Peter Zijlstra wrote: > > > > There seem to be some unbalanced rcu_read_{,un}lock() issues of late, > > how about doing something like this: > > This will break when rcu_read_lock() and rcu_read_unlock() are invoked > from NMI/SMI handlers -- the raw_local_irq_save() in lock_acquire() will > not mask NMIs or SMIs. > > One approach would be to check for being in an NMI/SMI handler, and > to avoid calling lock_acquire() and lock_release() in those cases. It seems: #define nmi_enter() do { lockdep_off(); __irq_enter(); } while (0) #define nmi_exit() do { __irq_exit(); lockdep_on(); } while (0) Should make it all work out just fine. (for NMIs at least, /me fully ignorant of the workings of SMIs) > Another approach would be to use sparse, which has checks for > rcu_read_lock()/rcu_read_unlock() nesting. Yeah, but one more method can never hurt, no? :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Stefan Richter wrote: Nick Piggin wrote: I don't know why people would assume volatile of atomics. AFAIK, most of the documentation is pretty clear that all the atomic stuff can be reordered etc. except for those that modify and return a value. Which documentation is there? Documentation/atomic_ops.txt For driver authors, there is LDD3. It doesn't specifically cover effects of optimization on accesses to atomic_t. For architecture port authors, there is Documentation/atomic_ops.txt. Driver authors also can learn something from that document, as it indirectly documents the atomic_t and bitops APIs. "Semantics and Behavior of Atomic and Bitmask Operations" is pretty direct :) Sure, it says that it's for arch maintainers, but there is no reason why users can't make use of it. Prompted by this thread, I reread this document, and indeed, the sentence "Unlike the above routines, it is required that explicit memory barriers are performed before and after [atomic_{inc,dec}_return]" indicates that atomic_read (one of the "above routines") is very different from all other atomic_t accessors that return values. This is strange. Why is it that atomic_read stands out that way? IMO It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set. this API imbalance is quite unexpected by many people. Wouldn't it be beneficial to change the atomic_read API to behave the same like all other atomic_t accessors that return values? It is very consistent and well defined. Operations which both modify the data _and_ return something are defined to have full barriers before and after. What do you want to add to the other atomic accessors? Full memory barriers? Only compiler barriers? It's quite likely that if you think some barriers will fix bugs, then there are other bugs lurking there anyway. Just use spinlocks if you're not absolutely clear about potential races and memory ordering issues -- they're pretty cheap and simple. OK, it is also different from the other accessors that return data in so far as it doesn't modify the data. But as driver "author", i.e. user of the API, I can't see much use of an atomic_read that can be reordered and, more importantly, can be optimized away by the compiler. It will return to you an atomic snapshot of the data (loaded from memory at some point since the last compiler barrier). All you have to be aware of compiler barriers and the Linux SMP memory ordering model, which should be a given if you are writing lockless code. Sure, now that I learned of these properties I can start to audit code and insert barriers where I believe they are needed, but this simply means that almost all occurrences of atomic_read will get barriers (unless there already are implicit but more or less obvious barriers like msleep). You might find that these places that appear to need barriers are buggy for other reasons anyway. Can you point to some in-tree code we can have a look at? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]: proc: export a processes resource limits via proc/
On Thu, 16 Aug 2007 08:35:38 EDT, Neil Horman said: > Hey again- > Andrew requested that I repost this cleanly, after running the patch > through checkpatch. As requested here it is with the changelog. > > Currently, there exists no method for a process to query the resource > limits of another process. They can be inferred via some mechanisms but they > cannot be explicitly determined. Given that this information can be usefull to > know during the debugging of an application, I've written this patch which > exports all of a processes limits via /proc//limits. > > Tested successfully by myself on x86 on top of 2.6.23-rc2-mm1. I had only one comment the first time around, and Neil addressed it. I've also tested on x86_64 23-rc2-mm1, and it works here too. I saw where this uses units of 'bytes' while the shell 'ulimit' uses 1024-byte units in some places, but (a) this lists the units and (b) it's consistent with setrlimit(). Testing with values >4G show it's 64-bit clean as well. One question: Is the units milliseconds, or seconds here: + [RLIMIT_CPU] = {"Max cpu time", "ms"}, Other than that, feel free to stick either/both of these on: Reviewed-By: Valdis Kletnieks <[EMAIL PROTECTED]> Tested-By: Valdis Kletnieks <[EMAIL PROTECTED]> pgpqnQgPc3e9C.pgp Description: PGP signature
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Paul Mackerras wrote: > Herbert Xu writes: > > > On Fri, Aug 17, 2007 at 03:09:57PM +1000, Paul Mackerras wrote: > > > Herbert Xu writes: > > > > > > > Can you find an actual atomic_read code snippet there that is > > > > broken without the volatile modifier? > > > > > > There are some in arch-specific code, for example line 1073 of > > > arch/mips/kernel/smtc.c. On mips, cpu_relax() is just barrier(), so > > > the empty loop body is ok provided that atomic_read actually does the > > > load each time around the loop. > > > > A barrier() is all you need to force the compiler to reread > > the value. > > > > The people advocating volatile in this thread are talking > > about code that doesn't use barrier()/cpu_relax(). > > Did you look at it? Here it is: > > /* Someone else is initializing in parallel - let 'em finish */ > while (atomic_read(&idle_hook_initialized) < 1000) > ; Honestly, this thread is suffering from HUGE communication gaps. What Herbert (obviously) meant there was that "this loop could've been okay _without_ using volatile-semantics-atomic_read() also, if only it used cpu_relax()". That does work, because cpu_relax() is _at least_ barrier() on all archs (on some it also emits some arch-dependent "pause" kind of instruction). Now, saying that "MIPS does not have such an instruction so I won't use cpu_relax() for arch-dependent-busy-while-loops in arch/mips/" sounds like a wrong argument, because: tomorrow, such arch's _may_ introduce such an instruction, so naturally, at that time we'd change cpu_relax() appropriately (in reality, we would actually *re-define* cpu_relax() and ensure that the correct version gets pulled in depending on whether the callsite code is legacy or only for the newer such CPUs of said arch, whatever), but loops such as this would remain un-changed, because they never used cpu_relax()! OTOH an argument that said the following would've made a stronger case: "I don't want to use cpu_relax() because that's a full memory clobber barrier() and I have loop-invariants / other variables around in that code that I *don't* want the compiler to forget just because it used cpu_relax(), and hence I will not use cpu_relax() but instead make my atomic_read() itself have "volatility" semantics. Not just that, but I will introduce a cpu_relax_no_barrier() on MIPS, that would be a no-op #define for now, but which may not be so forever, and continue to use that in such busy loops." In general, please read the thread-summary I've tried to do at: http://lkml.org/lkml/2007/8/17/25 Feel free to continue / comment / correct stuff from there, there's too much confusion and circular-arguments happening on this thread otherwise. [ I might've made an incorrect statement there about "volatile" w.r.t. cache on non-x86 archs, I think. ] Satyam - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC 3/4]Introduce "account modifiers" mechanism
On Fri, 2007-08-17 at 09:35 +0200, Laurent Vivier wrote: > Rusty Russell wrote: > > Hi Laurent, > > Hi Rusty, > how are your puppies ? They're getting a little fat, actually. Too many features ... > - remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate > guest time (by calling something like guest_enter() and guest_exit() from the > virtualization engine), and when in account_system_time() we have cputime > > vtime we substrate vtime from cputime and add vtime to user time and guest > time. > But doing like this we freeze in kernel/sched.c the link between system time, > user time and guest time (i.e. system time = system time - vtime, user time = > user time + vtime and guest time = guest time + vtime). Actually, I think we can set a per-cpu "in_guest" flag for the scheduler code, which then knows to add the tick to the guest time. That seems the simplest possible solution. lguest or kvm would set the flag before running the guest (which is done with preempt disabled or using preemption hooks), and reset it afterwards. Thoughts? Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: how can I get an account on master.kernel.org machine
Randy, Eric, Thanks you very much! There is only diffs in http://ftp.arm.linux.org.uk/pub/armlinux/kernel/git-cur/. What's these diffs base on, Linus' kernel tree? 2007/8/16, eric miao <[EMAIL PROTECTED]>: > And if you purpose is only to download that tree, my suggestion is that > some of the git trees are not public like linux-arm so the owner can do > whatever he likes in the tree, and don't ever try to have access to it. > > for linux-arm, it will be constantly pulled by Linus, unless you really > want the _most_ up-to-date patches, which you might find them at > > http://ftp.arm.linux.org.uk/pub/armlinux/kernel/git-cur/ > > that folder will be constantly refreshed by the maintainer. > > - eric > > Randy Dunlap wrote: > > On Thu, 16 Aug 2007 10:58:41 +0800 ye janboe wrote: > > > >> I want to down linux-arm git repos, but I do not have an account on > >> master machine. > > > > see http://www.kernel.org/faq/#account > > > > --- > > ~Randy > > *** Remember to use Documentation/SubmitChecklist when testing your code *** > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to [EMAIL PROTECTED] > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Satyam Sharma wrote: > > > #define atomic_read_volatile(v) \ > > ({ \ > > forget((v)->counter); \ > > ((v)->counter); \ > > }) > > > > where: > > *vomit* :) I wonder if this'll generate smaller and better code than _both_ the other atomic_read_volatile() variants. Would need to build allyesconfig on lots of diff arch's etc to test the theory though. > Not only do I hate the keyword volatile, but the barrier is only a > one-sided affair so its probable this is going to have slightly > different allowed reorderings than a real volatile access. True ... > Also, why would you want to make these insane accessors for atomic_t > types? Just make sure everybody knows the basics of barriers, and they > can apply that knowledge to atomic_t and all other lockless memory > accesses as well. Code that looks like: while (!atomic_read(&v)) { ... cpu_relax_no_barrier(); forget(v.counter); } would be uglier. Also think about code such as: a = atomic_read(); if (!a) do_something(); forget(); a = atomic_read(); ... /* some code that depends on value of a, obviously */ forget(); a = atomic_read(); ... So much explicit sprinkling of "forget()" looks ugly. atomic_read_volatile() on the other hand, looks neater. The "_volatile()" suffix makes it also no less explicit than an explicit barrier-like macro that this primitive is something "special", for code clarity purposes. > > #define forget(a) __asm__ __volatile__ ("" :"=m" (a) :"m" (a)) > > I like order(x) better, but it's not the most perfect name either. forget(x) is just a stupid-placeholder-for-a-better-name. order(x) sounds good but we could leave quibbling about function or macro names for later, this thread is noisy as it is :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Herbert Xu wrote: On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote: BTW, the sort of missing barriers that triggered this thread aren't that subtle. It'll result in a simple lock-up if the loop condition holds upon entry. At which point it's fairly straightforward to find the culprit. Not necessarily. A barrier-less buggy code such as below: atomic_set(&v, 0); ... /* some initial code */ while (atomic_read(&v)) ; ... /* code that MUST NOT be executed unless v becomes non-zero */ (where v->counter is has no volatile access semantics) could be generated by the compiler to simply *elid* or *do away* with the loop itself, thereby making the: "/* code that MUST NOT be executed unless v becomes non-zero */" to be executed even when v is zero! That is subtle indeed, and causes no hard lockups. Then I presume you mean while (!atomic_read(&v)) ; Which is just the same old infinite loop bug solved with cpu_relax(). These are pretty trivial to audit and fix, and also to debug, I would think. Granted, the above IS buggy code. But, the stated objective is to avoid heisenbugs. Anyway, why are you making up code snippets that are buggy in other ways in order to support this assertion being made that lots of kernel code supposedly depends on volatile semantics. Just reference the actual code. And we have driver / subsystem maintainers such as Stefan coming up and admitting that often a lot of code that's written to use atomic_read() does assume the read will not be elided by the compiler. So these are broken on i386 and x86-64? Are they definitely safe on SMP and weakly ordered machines with just a simple compiler barrier there? Because I would not be surprised if there are a lot of developers who don't really know what to assume when it comes to memory ordering issues. This is not a dig at driver writers: we still have memory ordering problems in the VM too (and probably most of the subtle bugs in lockless VM code are memory ordering ones). Let's not make up a false sense of security and hope that sprinkling volatile around will allow people to write bug-free lockless code. If a writer can't be bothered reading API documentation and learning the Linux memory model, they can still be productive writing safely locked code. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[-mm PATCH 1/9] Memory controller resource counters (v6)
From: Pavel Emelianov <[EMAIL PROTECTED]> Introduce generic structures and routines for resource accounting. Each resource accounting container is supposed to aggregate it, container_subsystem_state and its resource-specific members within. Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]> Signed-off-by: <[EMAIL PROTECTED]> --- include/linux/res_counter.h | 102 + init/Kconfig|7 ++ kernel/Makefile |1 kernel/res_counter.c| 120 4 files changed, 230 insertions(+) diff -puN /dev/null include/linux/res_counter.h --- /dev/null 2007-06-01 20:42:04.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/res_counter.h 2007-08-17 13:14:18.0 +0530 @@ -0,0 +1,102 @@ +#ifndef __RES_COUNTER_H__ +#define __RES_COUNTER_H__ + +/* + * Resource Counters + * Contain common data types and routines for resource accounting + * + * Copyright 2007 OpenVZ SWsoft Inc + * + * Author: Pavel Emelianov <[EMAIL PROTECTED]> + * + */ + +#include + +/* + * The core object. the container that wishes to account for some + * resource may include this counter into its structures and use + * the helpers described beyond + */ + +struct res_counter { + /* +* the current resource consumption level +*/ + unsigned long usage; + /* +* the limit that usage cannot exceed +*/ + unsigned long limit; + /* +* the number of unsuccessful attempts to consume the resource +*/ + unsigned long failcnt; + /* +* the lock to protect all of the above. +* the routines below consider this to be IRQ-safe +*/ + spinlock_t lock; +}; + +/* + * Helpers to interact with userspace + * res_counter_read/_write - put/get the specified fields from the + * res_counter struct to/from the user + * + * @counter: the counter in question + * @member: the field to work with (see RES_xxx below) + * @buf: the buffer to opeate on,... + * @nbytes: its size... + * @pos: and the offset. + */ + +ssize_t res_counter_read(struct res_counter *counter, int member, + const char __user *buf, size_t nbytes, loff_t *pos); +ssize_t res_counter_write(struct res_counter *counter, int member, + const char __user *buf, size_t nbytes, loff_t *pos); + +/* + * the field descriptors. one for each member of res_counter + */ + +enum { + RES_USAGE, + RES_LIMIT, + RES_FAILCNT, +}; + +/* + * helpers for accounting + */ + +void res_counter_init(struct res_counter *counter); + +/* + * charge - try to consume more resource. + * + * @counter: the counter + * @val: the amount of the resource. each controller defines its own + * units, e.g. numbers, bytes, Kbytes, etc + * + * returns 0 on success and <0 if the counter->usage will exceed the + * counter->limit _locked call expects the counter->lock to be taken + */ + +int res_counter_charge_locked(struct res_counter *counter, unsigned long val); +int res_counter_charge(struct res_counter *counter, unsigned long val); + +/* + * uncharge - tell that some portion of the resource is released + * + * @counter: the counter + * @val: the amount of the resource + * + * these calls check for usage underflow and show a warning on the console + * _locked call expects the counter->lock to be taken + */ + +void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val); +void res_counter_uncharge(struct res_counter *counter, unsigned long val); + +#endif diff -puN init/Kconfig~res_counters_infra init/Kconfig --- linux-2.6.23-rc2-mm2/init/Kconfig~res_counters_infra2007-08-17 13:14:18.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/init/Kconfig2007-08-17 13:14:18.0 +0530 @@ -337,6 +337,13 @@ config CPUSETS Say N if unsure. +config RESOURCE_COUNTERS + bool "Resource counters" + help + This option enables controller independent resource accounting + infrastructure that works with containers + depends on CONTAINERS + config SYSFS_DEPRECATED bool "Create deprecated sysfs files" default y diff -puN kernel/Makefile~res_counters_infra kernel/Makefile --- linux-2.6.23-rc2-mm2/kernel/Makefile~res_counters_infra 2007-08-17 13:14:18.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/kernel/Makefile 2007-08-17 13:14:18.0 +0530 @@ -58,6 +58,7 @@ obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_SYSCTL) += utsname_sysctl.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o +obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <[EMAIL PROTECTED]>, the -fno-omit-frame-pointer is diff -puN /dev/null kernel/res_counter.c --- /dev/null 2007-06-01 20:42:04.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/kernel/res_counter.c2007-08-
[-mm PATCH 5/9] Memory controller task migration (v6)
Allow tasks to migrate from one container to the other. We migrate mm_struct's mem_container only when the thread group id migrates. Signed-off-by: <[EMAIL PROTECTED]> --- mm/memcontrol.c | 35 +++ 1 file changed, 35 insertions(+) diff -puN mm/memcontrol.c~mem-control-task-migration mm/memcontrol.c --- linux-2.6.23-rc2-mm2/mm/memcontrol.c~mem-control-task-migration 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:19.0 +0530 @@ -326,11 +326,46 @@ static int mem_container_populate(struct ARRAY_SIZE(mem_container_files)); } +static void mem_container_move_task(struct container_subsys *ss, + struct container *cont, + struct container *old_cont, + struct task_struct *p) +{ + struct mm_struct *mm; + struct mem_container *mem, *old_mem; + + mm = get_task_mm(p); + if (mm == NULL) + return; + + mem = mem_container_from_cont(cont); + old_mem = mem_container_from_cont(old_cont); + + if (mem == old_mem) + goto out; + + /* +* Only thread group leaders are allowed to migrate, the mm_struct is +* in effect owned by the leader +*/ + if (p->tgid != p->pid) + goto out; + + css_get(&mem->css); + rcu_assign_pointer(mm->mem_container, mem); + css_put(&old_mem->css); + +out: + mmput(mm); + return; +} + struct container_subsys mem_container_subsys = { .name = "memory", .subsys_id = mem_container_subsys_id, .create = mem_container_create, .destroy = mem_container_destroy, .populate = mem_container_populate, + .attach = mem_container_move_task, .early_init = 1, }; _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[-mm PATCH 4/9] Memory controller memory accounting (v6)
Changelog for v6 1. Do a css_put() in the case of a race in allocating page containers (YAMAMOTO Takashi) Changelog for v5 1. Rename meta_page to page_container 2. Remove PG_metapage and use the lower bit of the page_container pointer for locking Changelog for v3 1. Fix a probable leak with meta_page's (pointed out by Paul Menage) 2. Introduce a wrapper around mem_container_uncharge for uncharging pages mem_container_uncharge_page() Changelog 1. Improved error handling, uncharge on errors and check to see if we are leaking pages (review by YAMAMOTO Takashi) Add the accounting hooks. The accounting is carried out for RSS and Page Cache (unmapped) pages. There is now a common limit and accounting for both. The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap() time. Page cache is accounted at add_to_page_cache(), __delete_from_page_cache(). Swap cache is also accounted for. Each page's page_container is protected with the last bit of the page_container pointer, this makes handling of race conditions involving simultaneous mappings of a page easier. A reference count is kept in the page_container to deal with cases where a page might be unmapped from the RSS of all tasks, but still lives in the page cache. Credits go to Vaidyanathan Srinivasan for helping with reference counting work of the page container. Almost all of the page cache accounting code has help from Vaidyanathan Srinivasan. Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]> Signed-off-by: <[EMAIL PROTECTED]> --- include/linux/memcontrol.h | 20 + mm/filemap.c | 12 ++- mm/memcontrol.c| 166 - mm/memory.c| 43 ++- mm/migrate.c |6 + mm/page_alloc.c|3 mm/rmap.c | 17 mm/swap_state.c| 12 ++- mm/swapfile.c | 41 ++- 9 files changed, 293 insertions(+), 27 deletions(-) diff -puN include/linux/memcontrol.h~mem-control-accounting include/linux/memcontrol.h --- linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-accounting 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h 2007-08-17 13:14:19.0 +0530 @@ -30,6 +30,13 @@ extern void mm_free_container(struct mm_ extern void page_assign_page_container(struct page *page, struct page_container *pc); extern struct page_container *page_get_page_container(struct page *page); +extern int mem_container_charge(struct page *page, struct mm_struct *mm); +extern void mem_container_uncharge(struct page_container *pc); + +static inline void mem_container_uncharge_page(struct page *page) +{ + mem_container_uncharge(page_get_page_container(page)); +} #else /* CONFIG_CONTAINER_MEM_CONT */ static inline void mm_init_container(struct mm_struct *mm, @@ -51,6 +58,19 @@ static inline struct page_container *pag return NULL; } +static inline int mem_container_charge(struct page *page, struct mm_struct *mm) +{ + return 0; +} + +static inline void mem_container_uncharge(struct page_container *pc) +{ +} + +static inline void mem_container_uncharge_page(struct page *page) +{ +} + #endif /* CONFIG_CONTAINER_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff -puN include/linux/page-flags.h~mem-control-accounting include/linux/page-flags.h diff -puN mm/filemap.c~mem-control-accounting mm/filemap.c --- linux-2.6.23-rc2-mm2/mm/filemap.c~mem-control-accounting2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/filemap.c2007-08-17 13:14:19.0 +0530 @@ -31,6 +31,7 @@ #include #include #include /* for BUG_ON(!in_atomic()) only */ +#include #include "internal.h" /* @@ -116,6 +117,7 @@ void __remove_from_page_cache(struct pag { struct address_space *mapping = page->mapping; + mem_container_uncharge_page(page); radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; mapping->nrpages--; @@ -442,6 +444,11 @@ int add_to_page_cache(struct page *page, int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); if (error == 0) { + + error = mem_container_charge(page, current->mm); + if (error) + goto out; + write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); if (!error) { @@ -451,10 +458,13 @@ int add_to_page_cache(struct page *page, page->index = offset; mapping->nrpages++; __inc_zone_page_state(page, NR_FILE_PAGES); - } + } else + mem_container_uncharge_page(page); + write_unlock_irq(&mapping->tree_lock); radix_tree_preload_end();
[-mm PATCH 2/9] Memory controller containers setup (v6)
Changelong 1. use depends instead of select in init/Kconfig 2. Port to v11 3. Clean up the usage of names (container files) for v11 Setup the memory container and add basic hooks and controls to integrate and work with the container. Signed-off-by: <[EMAIL PROTECTED]> --- include/linux/container_subsys.h |6 + include/linux/memcontrol.h | 21 ++ init/Kconfig |7 ++ mm/Makefile |1 mm/memcontrol.c | 127 +++ 5 files changed, 162 insertions(+) diff -puN include/linux/container_subsys.h~mem-control-setup include/linux/container_subsys.h --- linux-2.6.23-rc2-mm2/include/linux/container_subsys.h~mem-control-setup 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/container_subsys.h 2007-08-17 13:14:19.0 +0530 @@ -30,3 +30,9 @@ SUBSYS(ns) #endif /* */ + +#ifdef CONFIG_CONTAINER_MEM_CONT +SUBSYS(mem_container) +#endif + +/* */ diff -puN /dev/null include/linux/memcontrol.h --- /dev/null 2007-06-01 20:42:04.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h 2007-08-17 13:14:19.0 +0530 @@ -0,0 +1,21 @@ +/* memcontrol.h - Memory Controller + * + * Copyright IBM Corporation, 2007 + * Author Balbir Singh <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef _LINUX_MEMCONTROL_H +#define _LINUX_MEMCONTROL_H + +#endif /* _LINUX_MEMCONTROL_H */ + diff -puN init/Kconfig~mem-control-setup init/Kconfig --- linux-2.6.23-rc2-mm2/init/Kconfig~mem-control-setup 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/init/Kconfig2007-08-17 13:14:19.0 +0530 @@ -364,6 +364,13 @@ config SYSFS_DEPRECATED If you are using a distro that was released in 2006 or later, it should be safe to say N here. +config CONTAINER_MEM_CONT + bool "Memory controller for containers" + depends on CONTAINERS && RESOURCE_COUNTERS + help + Provides a memory controller that manages both page cache and + RSS memory. + config PROC_PID_CPUSET bool "Include legacy /proc//cpuset file" depends on CPUSETS diff -puN mm/Makefile~mem-control-setup mm/Makefile --- linux-2.6.23-rc2-mm2/mm/Makefile~mem-control-setup 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/Makefile 2007-08-17 13:14:19.0 +0530 @@ -31,4 +31,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o +obj-$(CONFIG_CONTAINER_MEM_CONT) += memcontrol.o diff -puN /dev/null mm/memcontrol.c --- /dev/null 2007-06-01 20:42:04.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:19.0 +0530 @@ -0,0 +1,127 @@ +/* memcontrol.c - Memory Controller + * + * Copyright IBM Corporation, 2007 + * Author Balbir Singh <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +struct container_subsys mem_container_subsys; + +/* + * The memory controller data structure. The memory controller controls both + * page cache and RSS per container. We would eventually like to provide + * statistics based on the statistics developed by Rik Van Riel for clock-pro, + * to help the administrator determine what knobs to tune. + * + * TODO: Add a water mark for the memory controller. Reclaim will begin when + * we hit the water mark. + */ +struct mem_container { + struct container_subsys_state css; + /* +* the counter to account for memory usage +*/ + struct res_counter res; +}; + +/* + * A page_container page is associated with every page descriptor. The + * page_container helps us identify information about the container + */ +struct page_container { + struct list_head lru; /* per container LRU list */ + struct page *page; + struct mem_container *mem_container; +}; + + +static inline +st
[-mm PATCH 0/9] Memory controller introduction (v6)
Here's version 6 of the memory controller (against 2.6.23-rc2-mm2). The tests that were run has been included in the _Test Results section_ below. Changelog since version 5 1. Ported to 2.6.23-rc2-mm2 2. Added a css_put() in the case of race between allocation of page_containers (YAMAMOTO Takashi) Changelog since version 4 1. Renamed meta_page to page_container (Nick Piggin) 2. Moved locking from page flags to last bit of the page_container pointer (Nick Piggin) 3. Fixed a rare race in mem_container_isolate_pages (YAMAMOTO Takashi) Changelog since version 3 1. Ported to v11 of the containers patchset (2.6.23-rc1-mm1). Paul Menage helped immensely with a detailed review of v3 2. Reclaim is retried to allow reclaim of pages coming in as a result of mapped pages reclaim (swap cache growing as a result of RSS reclaim) 3. page_referenced() is now container aware. During container reclaim, references from other containers do not prevent a page from being reclaimed from a non-referencing container 4. Fixed a possible race condition spotted by YAMAMOTO Takashi Changelog since version 2 1. Improved error handling in mm/memory.c (spotted by YAMAMOTO Takashi) 2. Test results included 3. try_to_free_mem_container_pages() bug fix (sc->may_writepage is now set to !laptop_mode) Changelog since version 1 1. Fixed some compile time errors (in mm/migrate.c from Vaidyanathan S) 2. Fixed a panic seen when LIST_DEBUG is enabled 3. Added a mechanism to control whether we track page cache or both page cache and mapped pages (as requested by Pavel) 4. Dave Hansen provided detail review comments on the code. This patchset implements another version of the memory controller. These patches have been through a big churn, the first set of patches were posted last year and earlier this year at http://lkml.org/lkml/2007/2/19/10 This patchset draws from the patches listed above and from some of the contents of the patches posted by Vaidyanathan for page cache control. http://lkml.org/lkml/2007/6/20/92 At OLS, the resource management BOF, it was discussed that we need to manage RSS and unmapped page cache together. This patchset is a step towards that TODO's 1. Add memory controller water mark support. Reclaim on high water mark 2. Add support for shrinking on limit change 3. Add per zone per container LRU lists (this is being actively worked on by Pavel Emelianov) 4. Figure out a better CLUI for the controller 5. Add better statistics 6. Explore using read_unit64() as recommended by Paul Menage (NOTE: read_ulong() would also be nice to have) In case you have been using/testing the RSS controller, you'll find that this controller works slower than the RSS controller. The reason being that both swap cache and page cache is accounted for, so pages do go out to swap upon reclaim (they cannot live in the swap cache). Any test output, feedback, comments, suggestions are welcome! I am committed to fixing any bugs and improving the performance of the memory controller. Do not hesitate to send any fixes, request for fixes that is required. Using the patches 1. Enable Memory controller configuration 2. Compile and boot the new kernel 3. mount -t container container -o memory /container will mount the memory controller to the /container mount point 4. mkdir /container/a 5. echo $$ > /container/a/tasks (add tasks to the new container) 6. echo -n > /container/a/memory.limit example echo -n 204800 > /container/a/memory.limit, sets the limit to 800 MB on a system with 4K page size 7. run tasks, see the memory controller work 8. Report results, provide feedback 9. Develop/use new patches and go to step 1 Test Results Results for version 3 of the patch were posted at http://lwn.net/Articles/242554/ The code was also tested on a power box with regular machine usage scenarios, the config disabled and with a stress suite that touched all the memory in the system and was limited in a container. Dhaval ran several tests on v6 and gave his thumbs up to the controller (a hard to achieve goal :-) ). Run kernbench stress Three simultaneously and with one inside a container of 800 MB. Kernbench results running within the container of 800 MB Thu Aug 16 22:34:59 IST 2007 2.6.23-rc2-mm2-mem-v6 Average Half load -j 4 Run (std deviation): Elapsed Time 466.548 (47.6014) User Time 876.598 (10.5273) System Time 223.136 (1.29247) Percent CPU 237.2 (23.2744) Context Switches 146351 (6539.91) Sleeps 174003 (5031.94) Average Optimal load -j 32 Run (std deviation): Elapsed Time 423.496 (60.625) User Time 897.285 (23.0391) System Time 228.836 (6.11205) Percent CPU 257.1 (40.9022) Context Switches 262134 (123397) Sleeps 270815 (103597) Kernbench results running within the default container -- Thu Aug 16 22:34:33 IST 2007 2.6.23-rc2-mm2-mem-v6 Average Half load -
[-mm PATCH 7/9] Memory controller OOM handling (v6)
From: Pavel Emelianov <[EMAIL PROTECTED]> Out of memory handling for containers over their limit. A task from the container over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]> Signed-off-by: <[EMAIL PROTECTED]> --- include/linux/memcontrol.h |1 + mm/memcontrol.c|1 + mm/oom_kill.c | 42 ++ 3 files changed, 40 insertions(+), 4 deletions(-) diff -puN include/linux/memcontrol.h~mem-control-out-of-memory include/linux/memcontrol.h --- linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-out-of-memory 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h 2007-08-17 13:14:20.0 +0530 @@ -39,6 +39,7 @@ extern unsigned long mem_container_isola int mode, struct zone *z, struct mem_container *mem_cont, int active); +extern void mem_container_out_of_memory(struct mem_container *mem); static inline void mem_container_uncharge_page(struct page *page) { diff -puN mm/memcontrol.c~mem-control-out-of-memory mm/memcontrol.c --- linux-2.6.23-rc2-mm2/mm/memcontrol.c~mem-control-out-of-memory 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:20.0 +0530 @@ -322,6 +322,7 @@ int mem_container_charge(struct page *pa } css_put(&mem->css); + mem_container_out_of_memory(mem); goto free_pc; } diff -puN mm/oom_kill.c~mem-control-out-of-memory mm/oom_kill.c --- linux-2.6.23-rc2-mm2/mm/oom_kill.c~mem-control-out-of-memory 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/oom_kill.c 2007-08-17 13:14:20.0 +0530 @@ -25,6 +25,7 @@ #include #include #include +#include int sysctl_panic_on_oom; /* #define DEBUG */ @@ -48,7 +49,8 @@ int sysctl_panic_on_oom; *of least surprise ... (be careful when you change it) */ -unsigned long badness(struct task_struct *p, unsigned long uptime) +unsigned long badness(struct task_struct *p, unsigned long uptime, + struct mem_container *mem) { unsigned long points, cpu_time, run_time, s; struct mm_struct *mm; @@ -61,6 +63,13 @@ unsigned long badness(struct task_struct return 0; } +#ifdef CONFIG_CONTAINER_MEM_CONT + if (mem != NULL && mm->mem_container != mem) { + task_unlock(p); + return 0; + } +#endif + /* * The memory size of the process is the basis for the badness. */ @@ -198,7 +207,8 @@ static inline int constrained_alloc(stru * * (not docbooked, we don't want this one cluttering up the manual) */ -static struct task_struct *select_bad_process(unsigned long *ppoints) +static struct task_struct *select_bad_process(unsigned long *ppoints, + struct mem_container *mem) { struct task_struct *g, *p; struct task_struct *chosen = NULL; @@ -252,7 +262,7 @@ static struct task_struct *select_bad_pr if (p->oomkilladj == OOM_DISABLE) continue; - points = badness(p, uptime.tv_sec); + points = badness(p, uptime.tv_sec, mem); if (points > *ppoints || !chosen) { chosen = p; *ppoints = points; @@ -364,6 +374,30 @@ static int oom_kill_process(struct task_ return oom_kill_task(p); } +#ifdef CONFIG_CONTAINER_MEM_CONT +void mem_container_out_of_memory(struct mem_container *mem) +{ + unsigned long points = 0; + struct task_struct *p; + + container_lock(); + rcu_read_lock(); +retry: + p = select_bad_process(&points, mem); + if (PTR_ERR(p) == -1UL) + goto out; + + if (!p) + p = current; + + if (oom_kill_process(p, points, "Memory container out of memory")) + goto retry; +out: + rcu_read_unlock(); + container_unlock(); +} +#endif + static BLOCKING_NOTIFIER_HEAD(oom_notify_list); int register_oom_notifier(struct notifier_block *nb) @@ -436,7 +470,7 @@ retry: * Rambo mode: Shoot down a process and hope it solves whatever * issues we may have. */ - p = select_bad_process(&points); + p = select_bad_process(&points, NULL); if (PTR_ERR(p) == -1UL) goto out; _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a mes
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Stefan Richter wrote: > [...] > Just use spinlocks if you're not absolutely clear about potential > races and memory ordering issues -- they're pretty cheap and simple. I fully agree with this. As Paul Mackerras mentioned elsewhere, a lot of authors sprinkle atomic_t in code thinking they're somehow done with *locking*. This is sad, and I wonder if it's time for a Documentation/atomic-considered-dodgy.txt kind of document :-) > > Sure, now > > that I learned of these properties I can start to audit code and insert > > barriers where I believe they are needed, but this simply means that > > almost all occurrences of atomic_read will get barriers (unless there > > already are implicit but more or less obvious barriers like msleep). > > You might find that these places that appear to need barriers are > buggy for other reasons anyway. Can you point to some in-tree code > we can have a look at? Such code was mentioned elsewhere (query nodemgr_host_thread in cscope) that managed to escape the requirement for a barrier only because of some completely un-obvious compilation-unit-scope thing. But I find such an non-explicit barrier quite bad taste. Stefan, do consider plunking an explicit call to barrier() there. Satyam - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[-mm PATCH 9/9] Memory controller make page_referenced() container aware (v6)
Make page_referenced() container aware. Without this patch, page_referenced() can cause a page to be skipped while reclaiming pages. This patch ensures that other containers do not hold pages in a particular container hostage. It is required to ensure that shared pages are freed from a container when they are not actively referenced from the container that brought them in Signed-off-by: <[EMAIL PROTECTED]> --- include/linux/memcontrol.h |6 ++ include/linux/rmap.h |5 +++-- mm/memcontrol.c|5 + mm/rmap.c | 30 -- mm/vmscan.c|4 ++-- 5 files changed, 40 insertions(+), 10 deletions(-) diff -puN include/linux/rmap.h~mem-control-per-container-page-referenced include/linux/rmap.h --- linux-2.6.23-rc2-mm2/include/linux/rmap.h~mem-control-per-container-page-referenced 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/rmap.h2007-08-17 13:14:20.0 +0530 @@ -8,6 +8,7 @@ #include #include #include +#include /* * The anon_vma heads a list of private "related" vmas, to scan if @@ -86,7 +87,7 @@ static inline void page_dup_rmap(struct /* * Called from mm/vmscan.c to handle paging out */ -int page_referenced(struct page *, int is_locked); +int page_referenced(struct page *, int is_locked, struct mem_container *cnt); int try_to_unmap(struct page *, int ignore_refs); /* @@ -114,7 +115,7 @@ int page_mkclean(struct page *); #define anon_vma_prepare(vma) (0) #define anon_vma_link(vma) do {} while (0) -#define page_referenced(page,l) TestClearPageReferenced(page) +#define page_referenced(page,l,cnt) TestClearPageReferenced(page) #define try_to_unmap(page, refs) SWAP_FAIL static inline int page_mkclean(struct page *page) diff -puN mm/rmap.c~mem-control-per-container-page-referenced mm/rmap.c --- linux-2.6.23-rc2-mm2/mm/rmap.c~mem-control-per-container-page-referenced 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/rmap.c 2007-08-17 13:14:20.0 +0530 @@ -299,7 +299,8 @@ out: return referenced; } -static int page_referenced_anon(struct page *page) +static int page_referenced_anon(struct page *page, + struct mem_container *mem_cont) { unsigned int mapcount; struct anon_vma *anon_vma; @@ -312,6 +313,13 @@ static int page_referenced_anon(struct p mapcount = page_mapcount(page); list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { + /* +* If we are reclaiming on behalf of a container, skip +* counting on behalf of references from different +* containers +*/ + if (mem_cont && (mm_container(vma->vm_mm) != mem_cont)) + continue; referenced += page_referenced_one(page, vma, &mapcount); if (!mapcount) break; @@ -332,7 +340,8 @@ static int page_referenced_anon(struct p * * This function is only called from page_referenced for object-based pages. */ -static int page_referenced_file(struct page *page) +static int page_referenced_file(struct page *page, + struct mem_container *mem_cont) { unsigned int mapcount; struct address_space *mapping = page->mapping; @@ -365,6 +374,13 @@ static int page_referenced_file(struct p mapcount = page_mapcount(page); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { + /* +* If we are reclaiming on behalf of a container, skip +* counting on behalf of references from different +* containers +*/ + if (mem_cont && (mm_container(vma->vm_mm) != mem_cont)) + continue; if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE)) == (VM_LOCKED|VM_MAYSHARE)) { referenced++; @@ -387,7 +403,8 @@ static int page_referenced_file(struct p * Quick test_and_clear_referenced for all mappings to a page, * returns the number of ptes which referenced the page. */ -int page_referenced(struct page *page, int is_locked) +int page_referenced(struct page *page, int is_locked, + struct mem_container *mem_cont) { int referenced = 0; @@ -399,14 +416,15 @@ int page_referenced(struct page *page, i if (page_mapped(page) && page->mapping) { if (PageAnon(page)) - referenced += page_referenced_anon(page); + referenced += page_referenced_anon(page, mem_cont); else if (is_locked) - referenced += page_referenced_file(page); + referenced += page_referenced_file(page, mem_cont); else if (TestSetPageLocked(page))
[-mm PATCH 8/9] Memory controller add switch to control what type of pages to limit (v6)
Choose if we want cached pages to be accounted or not. By default both are accounted for. A new set of tunables are added. echo -n 1 > mem_control_type switches the accounting to account for only mapped pages echo -n 3 > mem_control_type switches the behaviour back Signed-off-by: <[EMAIL PROTECTED]> --- include/linux/memcontrol.h |9 mm/filemap.c |2 mm/memcontrol.c| 92 + mm/swap_state.c|2 4 files changed, 103 insertions(+), 2 deletions(-) diff -puN include/linux/memcontrol.h~mem-control-choose-rss-vs-rss-and-pagecache include/linux/memcontrol.h --- linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-choose-rss-vs-rss-and-pagecache 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h 2007-08-17 13:14:20.0 +0530 @@ -20,6 +20,8 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H +#include + struct mem_container; struct page_container; @@ -40,6 +42,7 @@ extern unsigned long mem_container_isola struct mem_container *mem_cont, int active); extern void mem_container_out_of_memory(struct mem_container *mem); +extern int mem_container_cache_charge(struct page *page, struct mm_struct *mm); static inline void mem_container_uncharge_page(struct page *page) { @@ -84,6 +87,12 @@ static inline void mem_container_move_li { } +static inline int mem_container_cache_charge(struct page *page, + struct mm_struct *mm) +{ + return 0; +} + #endif /* CONFIG_CONTAINER_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff -puN mm/filemap.c~mem-control-choose-rss-vs-rss-and-pagecache mm/filemap.c --- linux-2.6.23-rc2-mm2/mm/filemap.c~mem-control-choose-rss-vs-rss-and-pagecache 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/filemap.c2007-08-17 13:14:20.0 +0530 @@ -445,7 +445,7 @@ int add_to_page_cache(struct page *page, if (error == 0) { - error = mem_container_charge(page, current->mm); + error = mem_container_cache_charge(page, current->mm); if (error) goto out; diff -puN mm/memcontrol.c~mem-control-choose-rss-vs-rss-and-pagecache mm/memcontrol.c --- linux-2.6.23-rc2-mm2/mm/memcontrol.c~mem-control-choose-rss-vs-rss-and-pagecache 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:20.0 +0530 @@ -28,6 +28,8 @@ #include #include +#include + struct container_subsys mem_container_subsys; static const int MEM_CONTAINER_RECLAIM_RETRIES = 5; @@ -59,6 +61,7 @@ struct mem_container { * spin_lock to protect the per container LRU */ spinlock_t lru_lock; + unsigned long control_type; /* control RSS or RSS+Pagecache */ }; /* @@ -81,6 +84,15 @@ struct page_container { /* mapped and cached states */ }; +enum { + MEM_CONTAINER_TYPE_UNSPEC = 0, + MEM_CONTAINER_TYPE_MAPPED, + MEM_CONTAINER_TYPE_CACHED, + MEM_CONTAINER_TYPE_ALL, + MEM_CONTAINER_TYPE_MAX, +} mem_control_type; + +static struct mem_container init_mem_container; static inline struct mem_container *mem_container_from_cont(struct container *cont) @@ -361,6 +373,22 @@ err: } /* + * See if the cached pages should be charged at all? + */ +int mem_container_cache_charge(struct page *page, struct mm_struct *mm) +{ + struct mem_container *mem; + if (!mm) + mm = &init_mm; + + mem = rcu_dereference(mm->mem_container); + if (mem->control_type == MEM_CONTAINER_TYPE_ALL) + return mem_container_charge(page, mm); + else + return 0; +} + +/* * Uncharging is always a welcome operation, we never complain, simply * uncharge. */ @@ -370,6 +398,10 @@ void mem_container_uncharge(struct page_ struct page *page; unsigned long flags; + /* +* This can handle cases when a page is not charged at all and we +* are switching between handling the control_type. +*/ if (!pc) return; @@ -405,6 +437,60 @@ static ssize_t mem_container_write(struc cft->private, userbuf, nbytes, ppos); } +static ssize_t mem_control_type_write(struct container *cont, + struct cftype *cft, struct file *file, + const char __user *userbuf, + size_t nbytes, loff_t *pos) +{ + int ret; + char *buf, *end; + unsigned long tmp; + struct mem_container *mem; + + mem = mem_container_from_cont(cont); + buf = kmalloc(nbytes + 1, GFP_KERNEL); + ret = -ENOMEM; + if (buf == NULL) + goto out; + +
[-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v6)
Changelog since v3 1. Added reclaim retry logic to avoid being OOM'ed due to pages from swap cache (coming in due to reclaim) don't overwhelm the container. Changelog 1. Fix probable NULL pointer dereference based on review comments by YAMAMOTO Takashi Add the page_container to the per container LRU. The reclaim algorithm has been modified to make the isolate_lru_pages() as a pluggable component. The scan_control data structure now accepts the container on behalf of which reclaims are carried out. try_to_free_pages() has been extended to become container aware. Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]> Signed-off-by: <[EMAIL PROTECTED]> --- include/linux/memcontrol.h | 12 +++ include/linux/res_counter.h | 23 +++ include/linux/swap.h|3 mm/memcontrol.c | 135 +++- mm/swap.c |2 mm/vmscan.c | 126 - 6 files changed, 275 insertions(+), 26 deletions(-) diff -puN include/linux/memcontrol.h~mem-control-lru-and-reclaim include/linux/memcontrol.h --- linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-lru-and-reclaim 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h 2007-08-17 13:14:20.0 +0530 @@ -32,6 +32,13 @@ extern void page_assign_page_container(s extern struct page_container *page_get_page_container(struct page *page); extern int mem_container_charge(struct page *page, struct mm_struct *mm); extern void mem_container_uncharge(struct page_container *pc); +extern void mem_container_move_lists(struct page_container *pc, bool active); +extern unsigned long mem_container_isolate_pages(unsigned long nr_to_scan, + struct list_head *dst, + unsigned long *scanned, int order, + int mode, struct zone *z, + struct mem_container *mem_cont, + int active); static inline void mem_container_uncharge_page(struct page *page) { @@ -71,6 +78,11 @@ static inline void mem_container_uncharg { } +static inline void mem_container_move_lists(struct page_container *pc, + bool active) +{ +} + #endif /* CONFIG_CONTAINER_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff -puN include/linux/res_counter.h~mem-control-lru-and-reclaim include/linux/res_counter.h --- linux-2.6.23-rc2-mm2/include/linux/res_counter.h~mem-control-lru-and-reclaim 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/res_counter.h 2007-08-17 13:14:20.0 +0530 @@ -99,4 +99,27 @@ int res_counter_charge(struct res_counte void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val); void res_counter_uncharge(struct res_counter *counter, unsigned long val); +static inline bool res_counter_limit_check_locked(struct res_counter *cnt) +{ + if (cnt->usage < cnt->limit) + return true; + + return false; +} + +/* + * Helper function to detect if the container is within it's limit or + * not. It's currently called from container_rss_prepare() + */ +static inline bool res_counter_check_under_limit(struct res_counter *cnt) +{ + bool ret; + unsigned long flags; + + spin_lock_irqsave(&cnt->lock, flags); + ret = res_counter_limit_check_locked(cnt); + spin_unlock_irqrestore(&cnt->lock, flags); + return ret; +} + #endif diff -puN include/linux/swap.h~mem-control-lru-and-reclaim include/linux/swap.h --- linux-2.6.23-rc2-mm2/include/linux/swap.h~mem-control-lru-and-reclaim 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/swap.h2007-08-17 13:14:20.0 +0530 @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -191,6 +192,8 @@ extern void swap_setup(void); /* linux/mm/vmscan.c */ extern unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask); +extern unsigned long try_to_free_mem_container_pages(struct mem_container *mem); +extern int __isolate_lru_page(struct page *page, int mode); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); diff -puN mm/memcontrol.c~mem-control-lru-and-reclaim mm/memcontrol.c --- linux-2.6.23-rc2-mm2/mm/memcontrol.c~mem-control-lru-and-reclaim 2007-08-17 13:14:20.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/mm/memcontrol.c 2007-08-17 13:14:20.0 +0530 @@ -24,8 +24,12 @@ #include #include #include +#include +#include +#include struct container_subsys mem_container_subsys; +static const int MEM_CONTAINER_RECLAIM_RETRIES = 5; /* * The memory co
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Friday 17 August 2007 05:42, Linus Torvalds wrote: > On Fri, 17 Aug 2007, Paul Mackerras wrote: > > I'm really surprised it's as much as a few K. I tried it on powerpc > > and it only saved 40 bytes (10 instructions) for a G5 config. > > One of the things that "volatile" generally screws up is a simple > > volatile int i; > > i++; But for atomic_t people use atomic_inc() anyways which does this correctly. It shouldn't really matter for atomic_t. I'm worrying a bit that the volatile atomic_t change caused subtle code breakage like these delay read loops people here pointed out. Wouldn't it be safer to just re-add the volatile to atomic_read() for 2.6.23? Or alternatively make it asm(), but volatile seems more proven. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Satyam Sharma wrote: > [...] > > Granted, the above IS buggy code. But, the stated objective is to avoid > > heisenbugs. ^^ > Anyway, why are you making up code snippets that are buggy in other > ways in order to support this assertion being made that lots of kernel > code supposedly depends on volatile semantics. Just reference the > actual code. Because the point is *not* about existing bugs in kernel code. At some point Chris Snook (who started this thread) did write that "If I knew of the existing bugs in the kernel, I would be sending patches for them, not this series" or something to that effect. The point is about *author expecations*. If people do expect atomic_read() (or a variant thereof) to have volatile semantics, why not give them such a variant? And by the way, the point is *also* about the fact that cpu_relax(), as of today, implies a full memory clobber, which is not what a lot of such loops want. (due to stuff mentioned elsewhere, summarized in that summary) > > And we have driver / subsystem maintainers such as Stefan > > coming up and admitting that often a lot of code that's written to use > > atomic_read() does assume the read will not be elided by the compiler. ^ (so it's about compiler barrier expectations only, though I fully agree that those who're using atomic_t as if it were some magic thing that lets them write lockless code are sorrily mistaken.) > So these are broken on i386 and x86-64? Possibly, but the point is not about existing bugs, as mentioned above. Some such bugs have been found nonetheless -- reminds me, can somebody please apply http://www.gossamer-threads.com/lists/linux/kernel/810674 ? Satyam - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fork Bombing Patch
I agree that (1) one risks overdoing comments and (2) one should minimize comments inside a function body. But I read your earlier statement: Please do not add comments inside functions. as simply requesting no comments inside function bodies, without exception. That seems to me to be too strong a statement. The "try to avoid" in the CodingStyle I read as meaning roughly "minimize" which is consistent with what it further states, allowing for "small comments" in certain cases. Admittedly, I probably agree with you that the command that Anand added in his patch was one of those comments that didn't add enough value to be justified. I was just quibbling over the wording of your objections to it. ... I'll shut up now. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC 3/4]Introduce "account modifiers" mechanism
Rusty Russell wrote: > On Fri, 2007-08-17 at 09:35 +0200, Laurent Vivier wrote: >> Rusty Russell wrote: >>> Hi Laurent, >> Hi Rusty, >> how are your puppies ? > > They're getting a little fat, actually. Too many features ... > >> - remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate >> guest time (by calling something like guest_enter() and guest_exit() from the >> virtualization engine), and when in account_system_time() we have cputime > >> vtime we substrate vtime from cputime and add vtime to user time and guest >> time. >> But doing like this we freeze in kernel/sched.c the link between system time, >> user time and guest time (i.e. system time = system time - vtime, user time = >> user time + vtime and guest time = guest time + vtime). > > Actually, I think we can set a per-cpu "in_guest" flag for the scheduler > code, which then knows to add the tick to the guest time. That seems > the simplest possible solution. > > lguest or kvm would set the flag before running the guest (which is done > with preempt disabled or using preemption hooks), and reset it > afterwards. > > Thoughts? It was my first attempt (except I didn't have a per-cpu flag, but a per-task flag), it's not visible but I love simplicity... ;-) A KVM VCPU is stopped by preemption, so when we enter in scheduler we have exited from VCPU and thus this flags is off (so we account 0 to the guest). What I did then is "set the flag on when we enter in the VCPU, and "account_system_time()" sets the flag off when it adds this timeslice to cpustat (and compute correctly guest, user, system time). But I didn't like this idea because all code executed after we entered in the VCPU is accounted to the guest until we have an account_system_time() and I suppose we can have real system time in this part. And I guess a VCPU can be less than 1 ms (unit of cputime) in a timeslice. So ? What's best ? Laurent -- - [EMAIL PROTECTED] -- "Software is hard" - Donald Knuth signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Sure, now that I learned of these properties I can start to audit code and insert barriers where I believe they are needed, but this simply means that almost all occurrences of atomic_read will get barriers (unless there already are implicit but more or less obvious barriers like msleep). You might find that these places that appear to need barriers are buggy for other reasons anyway. Can you point to some in-tree code we can have a look at? Such code was mentioned elsewhere (query nodemgr_host_thread in cscope) that managed to escape the requirement for a barrier only because of some completely un-obvious compilation-unit-scope thing. But I find such an non-explicit barrier quite bad taste. Stefan, do consider plunking an explicit call to barrier() there. It is very obvious. msleep calls schedule() (ie. sleeps), which is always a barrier. The "unobvious" thing is that you wanted to know how the compiler knows a function is a barrier -- answer is that if it does not *know* it is not a barrier, it must assume it is a barrier. If the whole msleep call chain including the scheduler were defined static in the current compilation unit, then it would still be a barrier because it would actually be able to see the barriers in schedule(void), if nothing else. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Also, why would you want to make these insane accessors for atomic_t types? Just make sure everybody knows the basics of barriers, and they can apply that knowledge to atomic_t and all other lockless memory accesses as well. Code that looks like: while (!atomic_read(&v)) { ... cpu_relax_no_barrier(); forget(v.counter); } would be uglier. Also think about code such as: I think they would both be equally ugly, but the atomic_read_volatile variant would be more prone to subtle bugs because of the weird implementation. And it would be more ugly than introducing an order(x) statement for all memory operations, and adding an order_atomic() wrapper for it for atomic types. a = atomic_read(); if (!a) do_something(); forget(); a = atomic_read(); ... /* some code that depends on value of a, obviously */ forget(); a = atomic_read(); ... So much explicit sprinkling of "forget()" looks ugly. Firstly, why is it ugly? It's nice because of those nice explicit statements there that give us a good heads up and would have some comments attached to them (also, lack of the word "volatile" is always a plus). Secondly, what sort of code would do such a thing? In most cases, it is probably riddled with bugs anyway (unless it is doing a really specific sequence of interrupts or something, but in that case it is very likely to either require locking or busy waits anyway -> ie. barriers). on the other hand, looks neater. The "_volatile()" suffix makes it also no less explicit than an explicit barrier-like macro that this primitive is something "special", for code clarity purposes. Just don't use the word volatile, and have barriers both before and after the memory operation, and I'm OK with it. I don't see the point though, when you could just have a single barrier(x) barrier function defined for all memory locations, rather than this odd thing that only works for atomics (and would have to be duplicated for atomic_set. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[-mm PATCH 3/9] Memory controller accounting setup (v6)
From: Pavel Emelianov <[EMAIL PROTECTED]> Changelog for v5 1. Remove inclusion of memcontrol.h from mm_types.h Changelog As per Paul's review comments 1. Drop css_get() for the root memory container 2. Use mem_container_from_task() as an optimization instead of using mem_container_from_cont() along with task_container. Basic setup routines, the mm_struct has a pointer to the container that it belongs to and the the page has a page_container associated with it. Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]> Signed-off-by: <[EMAIL PROTECTED]> --- include/linux/memcontrol.h | 36 include/linux/mm_types.h |6 include/linux/sched.h |1 kernel/fork.c | 11 ++-- mm/memcontrol.c| 57 + 5 files changed, 104 insertions(+), 7 deletions(-) diff -puN include/linux/memcontrol.h~mem-control-accounting-setup include/linux/memcontrol.h --- linux-2.6.23-rc2-mm2/include/linux/memcontrol.h~mem-control-accounting-setup 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/memcontrol.h 2007-08-17 13:14:19.0 +0530 @@ -3,6 +3,9 @@ * Copyright IBM Corporation, 2007 * Author Balbir Singh <[EMAIL PROTECTED]> * + * Copyright 2007 OpenVZ SWsoft Inc + * Author: Pavel Emelianov <[EMAIL PROTECTED]> + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -17,5 +20,38 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H +struct mem_container; +struct page_container; + +#ifdef CONFIG_CONTAINER_MEM_CONT + +extern void mm_init_container(struct mm_struct *mm, struct task_struct *p); +extern void mm_free_container(struct mm_struct *mm); +extern void page_assign_page_container(struct page *page, + struct page_container *pc); +extern struct page_container *page_get_page_container(struct page *page); + +#else /* CONFIG_CONTAINER_MEM_CONT */ +static inline void mm_init_container(struct mm_struct *mm, + struct task_struct *p) +{ +} + +static inline void mm_free_container(struct mm_struct *mm) +{ +} + +static inline void page_assign_page_container(struct page *page, + struct page_container *pc) +{ +} + +static inline struct page_container *page_get_page_container(struct page *page) +{ + return NULL; +} + +#endif /* CONFIG_CONTAINER_MEM_CONT */ + #endif /* _LINUX_MEMCONTROL_H */ diff -puN include/linux/mm_types.h~mem-control-accounting-setup include/linux/mm_types.h --- linux-2.6.23-rc2-mm2/include/linux/mm_types.h~mem-control-accounting-setup 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/mm_types.h2007-08-17 13:14:19.0 +0530 @@ -95,6 +95,9 @@ struct page { unsigned int gfp_mask; unsigned long trace[8]; #endif +#ifdef CONFIG_CONTAINER_MEM_CONT + unsigned long page_container; +#endif }; /* @@ -226,6 +229,9 @@ struct mm_struct { /* aio bits */ rwlock_tioctx_list_lock; struct kioctx *ioctx_list; +#ifdef CONFIG_CONTAINER_MEM_CONT + struct mem_container *mem_container; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff -puN include/linux/sched.h~mem-control-accounting-setup include/linux/sched.h --- linux-2.6.23-rc2-mm2/include/linux/sched.h~mem-control-accounting-setup 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/include/linux/sched.h 2007-08-17 13:14:19.0 +0530 @@ -87,6 +87,7 @@ struct sched_param { #include +struct mem_container; struct exec_domain; struct futex_pi_state; struct bio; diff -puN kernel/fork.c~mem-control-accounting-setup kernel/fork.c --- linux-2.6.23-rc2-mm2/kernel/fork.c~mem-control-accounting-setup 2007-08-17 13:14:19.0 +0530 +++ linux-2.6.23-rc2-mm2-balbir/kernel/fork.c 2007-08-17 13:14:19.0 +0530 @@ -50,6 +50,7 @@ #include #include #include +#include #include #include @@ -328,7 +329,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLO #include -static struct mm_struct * mm_init(struct mm_struct * mm) +static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p) { atomic_set(&mm->mm_users, 1); atomic_set(&mm->mm_count, 1); @@ -345,11 +346,14 @@ static struct mm_struct * mm_init(struct mm->ioctx_list = NULL; mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; + mm_init_container(mm, p); if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; return mm; } + + mm_free_container(mm); free_mm(mm); return NULL; } @@ -364,7 +368,7 @@ struct mm_struct * mm_alloc(void)
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Satyam Sharma wrote: [...] Granted, the above IS buggy code. But, the stated objective is to avoid heisenbugs. ^^ Anyway, why are you making up code snippets that are buggy in other ways in order to support this assertion being made that lots of kernel code supposedly depends on volatile semantics. Just reference the actual code. Because the point is *not* about existing bugs in kernel code. At some point Chris Snook (who started this thread) did write that "If I knew of the existing bugs in the kernel, I would be sending patches for them, not this series" or something to that effect. The point is about *author expecations*. If people do expect atomic_read() (or a variant thereof) to have volatile semantics, why not give them such a variant? Because they should be thinking about them in terms of barriers, over which the compiler / CPU is not to reorder accesses or cache memory operations, rather than "special" "volatile" accesses. Linux's whole memory ordering and locking model is completely geared around the former. And by the way, the point is *also* about the fact that cpu_relax(), as of today, implies a full memory clobber, which is not what a lot of such loops want. (due to stuff mentioned elsewhere, summarized in that summary) That's not the point, because as I also mentioned, the logical extention to Linux's barrier API to handle this is the order(x) macro. Again, not special volatile accessors. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix the return value of sys_set_tid_address()
This call should return the virtual pid to the caller, just like the sys_getpid()/sys_gettid() do, no the global one. Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]> Cc: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Cc: Oleg Nesterov <[EMAIL PROTECTED]> --- kernel/fork.c |2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/fork.c b/kernel/fork.c index 645c614..d7cff48 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -938,7 +938,7 @@ asmlinkage long sys_set_tid_address(int { current->clear_child_tid = tidptr; - return current->pid; + return task_pid_vnr(current); } static inline void rt_mutex_init_task(struct task_struct *p) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Use find_task_by_pid_ns() in places that operate with virtual pids
When the pid comes from the userspace, the find_task_by_pid_ns() should be used to find the task by pid in particular (usually the current) namespace. These places were lost in earlier patches. Think over: all these places work like this: if (pid == 0) task = current; else task = find_task_by_pid_ns(pid); the question is: does it worth introducing a common helper for such case and (if it does) what should its name be? Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]> Cc: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Cc: Oleg Nesterov <[EMAIL PROTECTED]> --- This is a lost hunks of the -mm patch named pid-namespaces-changes-to-show-virtual-ids-to-user.patch fs/ioprio.c |6 -- kernel/futex.c|6 -- kernel/futex_compat.c |3 ++- kernel/sched.c|3 ++- mm/mempolicy.c|3 ++- mm/migrate.c |3 ++- 6 files changed, 16 insertions(+), 8 deletions(-) diff --git a/fs/ioprio.c b/fs/ioprio.c index c99fe7b..0a615f8 100644 --- a/fs/ioprio.c +++ b/fs/ioprio.c @@ -94,7 +94,8 @@ asmlinkage long sys_ioprio_set(int which if (!who) p = current; else - p = find_task_by_pid(who); + p = find_task_by_pid_ns(who, + current->nsproxy->pid_ns); if (p) ret = set_task_ioprio(p, ioprio); break; @@ -181,7 +182,8 @@ asmlinkage long sys_ioprio_get(int which if (!who) p = current; else - p = find_task_by_pid(who); + p = find_task_by_pid_ns(who, + current->nsproxy->pid_ns); if (p) ret = get_task_ioprio(p); break; diff --git a/kernel/futex.c b/kernel/futex.c index c61634a..0c1b777 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -444,7 +444,8 @@ static struct task_struct * futex_find_g struct task_struct *p; rcu_read_lock(); - p = find_task_by_pid(pid); + p = find_task_by_pid_ns(pid, + current->nsproxy->pid_ns); if (!p || ((current->euid != p->euid) && (current->euid != p->uid))) p = ERR_PTR(-ESRCH); @@ -1855,7 +1856,8 @@ sys_get_robust_list(int pid, struct robu ret = -ESRCH; rcu_read_lock(); - p = find_task_by_pid(pid); + p = find_task_by_pid_ns(pid, + current->nsproxy->pid_ns); if (!p) goto err_unlock; ret = -EPERM; diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c index f792136..e7563bf 100644 --- a/kernel/futex_compat.c +++ b/kernel/futex_compat.c @@ -116,7 +116,8 @@ compat_sys_get_robust_list(int pid, comp ret = -ESRCH; read_lock(&tasklist_lock); - p = find_task_by_pid(pid); + p = find_task_by_pid_ns(pid, + current->nsproxy->pid_ns); if (!p) goto err_unlock; ret = -EPERM; diff --git a/kernel/sched.c b/kernel/sched.c index 4359b6b..1f81803 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4108,7 +4108,8 @@ struct task_struct *idle_task(int cpu) */ static inline struct task_struct *find_process_by_pid(pid_t pid) { - return pid ? find_task_by_pid(pid) : current; + return pid ? + find_task_by_pid_ns(pid, current->nsproxy->pid_ns) : current; } /* Actually do priority change: must hold rq lock. */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index e7eb2bb..934abc9 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -930,7 +930,8 @@ asmlinkage long sys_migrate_pages(pid_t /* Find the mm_struct */ read_lock(&tasklist_lock); - task = pid ? find_task_by_pid(pid) : current; + task = pid ? + find_task_by_pid_ns(pid, current->nsproxy->pid_ns) : current; if (!task) { read_unlock(&tasklist_lock); return -ESRCH; diff --git a/mm/migrate.c b/mm/migrate.c index 477b894..7f43c5f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -917,7 +917,8 @@ asmlinkage long sys_move_pages(pid_t pid /* Find the mm_struct */ read_lock(&tasklist_lock); - task = pid ? find_task_by_pid(pid) : current; + task = pid ? + find_task_by_pid_ns(pid, current->nsproxy->pid_ns) : current; if (!task) { read_unlock(&tasklist_lock); return -ESRCH; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of
[RFC][PATCH] maps: show swapped out pages in smaps
Show the amount of swapped out pages in /proc//smaps. Currently there's no way to know who is using the swap file. A possible better way to support it is to add a new counter to struct_mm. Or maybe not that important at all? Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]> --- fs/proc/task_mmu.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) --- linux-2.6.23-rc2-mm2.orig/fs/proc/task_mmu.c +++ linux-2.6.23-rc2-mm2/fs/proc/task_mmu.c @@ -324,6 +324,7 @@ struct mem_size_stats unsigned long private_clean; unsigned long private_dirty; unsigned long referenced; + unsigned long swapped; /* * Proportional Set Size(PSS): my share of RSS. @@ -367,8 +368,11 @@ static int smaps_pte_range(pmd_t *pmd, u pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; - if (!pte_present(ptent)) + if (!pte_present(ptent)) { + if (!pte_none(ptent) && !pte_file(ptent)) + mms->swapped += PAGE_SIZE; continue; + } mss->resident += PAGE_SIZE; @@ -425,7 +429,8 @@ static int show_smap(struct seq_file *m, "Shared_Dirty: %8lu kB\n" "Private_Clean: %8lu kB\n" "Private_Dirty: %8lu kB\n" - "Referenced: %8lu kB\n", + "Referenced: %8lu kB\n" + "Swapped:%8lu kB\n", (vma->vm_end - vma->vm_start) >> 10, sarg.mss.resident >> 10, (unsigned long)(sarg.mss.pss >> (10 + PSS_ERROR_BITS)), @@ -433,7 +438,8 @@ static int show_smap(struct seq_file *m, sarg.mss.shared_dirty >> 10, sarg.mss.private_clean >> 10, sarg.mss.private_dirty >> 10, - sarg.mss.referenced >> 10); + sarg.mss.referenced>> 10, + sarg.mss.swapped >> 10); return ret; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2.6.22.3] ppp: fix output buffer size in ppp_decompress_frame
This patch addresses the issue with "osize too small" errors in mppe encryption. The patch fixes the issue with wrong output buffer size being passed to ppp decompression routine. Signed-off-by: Konstantin Sharlaimov <[EMAIL PROTECTED]> --- As pointed out by Suresh Mahalingam, the issue addressed by ppp-fix-osize-too-small-errors-when-decoding patch is not fully resolved yet. The size of allocated output buffer is correct, however it size passed to ppp->rcomp->decompress in ppp_generic.c if wrong. The patch fixes that. --- linux-2.6.21.3/drivers/net/ppp_generic.c.orig 2007-08-17 20:35:03.0 +1100 +++ linux-2.6.21.3/drivers/net/ppp_generic.c2007-08-17 20:35:45.0 +1100 @@ -1726,7 +1726,7 @@ ppp_decompress_frame(struct ppp *ppp, st } /* the decompressor still expects the A/C bytes in the hdr */ len = ppp->rcomp->decompress(ppp->rc_state, skb->data - 2, - skb->len + 2, ns->data, ppp->mru + PPP_HDRLEN); + skb->len + 2, ns->data, obuff_size); if (len < 0) { /* Pass the compressed frame to pppd as an error indication. */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Smack: Simplified Mandatory Access Control Kernel
Btw, at: diff -uprN -X linux-2.6.22-base/Documentation/dontdiff linux-2.6.22-base/security/smack/Kconfig linux-2.6.22/security/smack/Kconfig --- linux-2.6.22-base/security/smack/Kconfig1969-12-31 16:00:00.0 -0800 +++ linux-2.6.22/security/smack/Kconfig 2007-07-10 01:08:05.0 -0700 @@ -0,0 +1,10 @@ +config SECURITY_SMACK + bool "Simplified Mandatory Access Control Kernel Support" + depends on NETLABEL && SECURITY_NETWORK + default n + help + This selects the Simplified Mandatory Access Control Kernel. + SMACK is useful for sensitivity, integrity, and a variety + of other madatory security schemes. + If you are unsure how to answer this question, answer N. + change: + of other madatory security schemes. to: + of other mandatory security schemes. -- Miguel Ojeda http://maxextreme.googlepages.com/index.htm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma writes: > I wonder if this'll generate smaller and better code than _both_ the > other atomic_read_volatile() variants. Would need to build allyesconfig > on lots of diff arch's etc to test the theory though. I'm sure it would be a tiny effect. This whole thread is arguing about effects that are quite insignificant. On the one hand we have the non-volatile proponents, who want to let the compiler do extra optimizations - which amounts to letting it elide maybe a dozen loads in the whole kernel, loads which would almost always be L1 cache hits. On the other hand we have the volatile proponents, who are concerned that some code somewhere in the kernel might be buggy without the volatile behaviour, and who also want to be able to remove some barriers and thus save a few bytes of code and a few loads here and there (and possibly some stores too). Either way the effect on code size and execution time is miniscule. In the end the strongest argument is actually that gcc generates unnecessarily verbose code on x86[-64] for volatile accesses. Even then we're only talking about ~2000 bytes, or less than 1 byte per instance of atomic_read on average, about 0.06% of the kernel text size. The x86[-64] developers seem to be willing to bear the debugging cost involved in having the non-volatile behaviour for atomic_read, in order to save the 2kB. That's fine with me. Either way I think somebody should audit all the uses of atomic_read, not just for missing barriers, but also to find the places where it's used in a racy manner. Then we can work out where the races matter and fix them if they do. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Satyam Sharma wrote: > > > > On Fri, 17 Aug 2007, Nick Piggin wrote: > > > > > Sure, now > > > > that I learned of these properties I can start to audit code and insert > > > > barriers where I believe they are needed, but this simply means that > > > > almost all occurrences of atomic_read will get barriers (unless there > > > > already are implicit but more or less obvious barriers like msleep). > > > > > > You might find that these places that appear to need barriers are > > > buggy for other reasons anyway. Can you point to some in-tree code > > > we can have a look at? > > > > > > Such code was mentioned elsewhere (query nodemgr_host_thread in cscope) > > that managed to escape the requirement for a barrier only because of > > some completely un-obvious compilation-unit-scope thing. But I find such > > an non-explicit barrier quite bad taste. Stefan, do consider plunking an > > explicit call to barrier() there. > > It is very obvious. msleep calls schedule() (ie. sleeps), which is > always a barrier. Probably you didn't mean that, but no, schedule() is not barrier because it sleeps. It's a barrier because it's invisible. > The "unobvious" thing is that you wanted to know how the compiler knows > a function is a barrier -- answer is that if it does not *know* it is not > a barrier, it must assume it is a barrier. True, that's clearly what happens here. But are you're definitely joking that this is "obvious" in terms of code-clarity, right? Just 5 minutes back you mentioned elsewhere you like seeing lots of explicit calls to barrier() (with comments, no less, hmm? :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Andi Kleen wrote: > On Friday 17 August 2007 05:42, Linus Torvalds wrote: > > On Fri, 17 Aug 2007, Paul Mackerras wrote: > > > I'm really surprised it's as much as a few K. I tried it on powerpc > > > and it only saved 40 bytes (10 instructions) for a G5 config. > > > > One of the things that "volatile" generally screws up is a simple > > > > volatile int i; > > > > i++; > > But for atomic_t people use atomic_inc() anyways which does this correctly. > It shouldn't really matter for atomic_t. > > I'm worrying a bit that the volatile atomic_t change caused subtle code > breakage like these delay read loops people here pointed out. Umm, I followed most of the thread, but which breakage is this? > Wouldn't it be safer to just re-add the volatile to atomic_read() > for 2.6.23? Or alternatively make it asm(), but volatile seems more > proven. The problem with volatile is not just trashy code generation (which also definitely is a major problem), but definition holes, and implementation inconsistencies. Making it asm() is not the only other alternative to volatile either (read another reply to this mail), but considering most of the thread has been about people not wanting even a atomic_read_volatile() variant, making atomic_read() itself have volatile semantics sounds ... strange :-) PS: http://lkml.org/lkml/2007/8/15/407 was submitted a couple days back, any word if you saw that? I have another one for you: [PATCH] i386, x86_64: __const_udelay() should not be marked inline Because it can never get inlined in any callsite (each translation unit is compiled separately for the kernel and so the implementation of __const_udelay() would be invisible to all other callsites). In fact it turns out, the correctness of callsites at arch/x86_64/kernel/crash.c:97 and arch/i386/kernel/crash.c:101 explicitly _depends_ upon it not being inlined, and also it's an exported symbol (modules may want to call mdelay() and udelay() that often becomes __const_udelay() after some macro-ing in various headers). So let's not mark it as "inline" either. Signed-off-by: Satyam Sharma <[EMAIL PROTECTED]> --- arch/i386/lib/delay.c |2 +- arch/x86_64/lib/delay.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/i386/lib/delay.c b/arch/i386/lib/delay.c index f6edb11..0082c99 100644 --- a/arch/i386/lib/delay.c +++ b/arch/i386/lib/delay.c @@ -74,7 +74,7 @@ void __delay(unsigned long loops) delay_fn(loops); } -inline void __const_udelay(unsigned long xloops) +void __const_udelay(unsigned long xloops) { int d0; diff --git a/arch/x86_64/lib/delay.c b/arch/x86_64/lib/delay.c index 2dbebd3..d0cd9cd 100644 --- a/arch/x86_64/lib/delay.c +++ b/arch/x86_64/lib/delay.c @@ -38,7 +38,7 @@ void __delay(unsigned long loops) } EXPORT_SYMBOL(__delay); -inline void __const_udelay(unsigned long xloops) +void __const_udelay(unsigned long xloops) { __delay(((xloops * HZ * cpu_data[raw_smp_processor_id()].loops_per_jiffy) >> 32) + 1); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix check for return value of create_pid_namespace
The function in question returns ERR_PTR-s to indicate the error, while the caller checks for return value to be NULL. Found during testing the OpenVZ kernel with pid namespaces. Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]> Cc: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Cc: Oleg Nesterov <[EMAIL PROTECTED]> --- kernel/pid.c |2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/pid.c b/kernel/pid.c index faad017..cb1f3cb 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -543,7 +543,7 @@ struct pid_namespace *copy_pid_ns(unsign goto out_put; new_ns = create_pid_namespace(old_ns->level + 1); - if (new_ns != NULL) + if (!IS_ERR(new_ns)) new_ns->parent = get_pid_ns(old_ns); out_put: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Satyam Sharma wrote: > > > > On Fri, 17 Aug 2007, Nick Piggin wrote: > > > > Also, why would you want to make these insane accessors for atomic_t > > > types? Just make sure everybody knows the basics of barriers, and they > > > can apply that knowledge to atomic_t and all other lockless memory > > > accesses as well. > > > > > > Code that looks like: > > > > while (!atomic_read(&v)) { > > ... > > cpu_relax_no_barrier(); > > forget(v.counter); > > > > } > > > > would be uglier. Also think about code such as: > > I think they would both be equally ugly, You think both these are equivalent in terms of "looks": | while (!atomic_read(&v)) { | while (!atomic_read_xxx(&v)) { ... | ... cpu_relax_no_barrier(); | cpu_relax_no_barrier(); order_atomic(&v); | } } | (where order_atomic() is an atomic_t specific wrapper as you mentioned below) ? Well, taste varies, but ... > but the atomic_read_volatile > variant would be more prone to subtle bugs because of the weird > implementation. What bugs? > And it would be more ugly than introducing an order(x) statement for > all memory operations, and adding an order_atomic() wrapper for it > for atomic types. Oh, that order() / forget() macro [forget() was named such by Chuck Ebbert earlier in this thread where he first mentioned it, btw] could definitely be generically introduced for any memory operations. > > a = atomic_read(); > > if (!a) > > do_something(); > > > > forget(); > > a = atomic_read(); > > ... /* some code that depends on value of a, obviously */ > > > > forget(); > > a = atomic_read(); > > ... > > > > So much explicit sprinkling of "forget()" looks ugly. > > Firstly, why is it ugly? It's nice because of those nice explicit > statements there that give us a good heads up and would have some > comments attached to them atomic_read_xxx (where xxx = whatever naming sounds nice to you) would obviously also give a heads up, and could also have some comments attached to it. > (also, lack of the word "volatile" is always a plus). Ok, xxx != volatile. > Secondly, what sort of code would do such a thing? See the nodemgr_host_thread() that does something similar, though not exactly same. > > on the other hand, looks neater. The "_volatile()" suffix makes it also > > no less explicit than an explicit barrier-like macro that this primitive > > is something "special", for code clarity purposes. > > Just don't use the word volatile, That sounds amazingly frivolous, but hey, why not. As I said, ok, xxx != volatile. > and have barriers both before and after the memory operation, How could that lead to bugs? (if you can point to existing code, but just some testcase / sample code would be fine as well). > [...] I don't see > the point though, when you could just have a single barrier(x) > barrier function defined for all memory locations, As I said, barrier() is too heavy-handed. > rather than > this odd thing that only works for atomics Why would it work only for atomics? You could use that generic macro for anything you well damn please. > (and would have to > be duplicated for atomic_set. #define atomic_set_xxx for something similar. Big deal ... NOT. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Paul Mackerras wrote: > Satyam Sharma writes: > > > I wonder if this'll generate smaller and better code than _both_ the > > other atomic_read_volatile() variants. Would need to build allyesconfig > > on lots of diff arch's etc to test the theory though. > > I'm sure it would be a tiny effect. > > This whole thread is arguing about effects that are quite > insignificant. Hmm, the fact that this thread became what it did, probably means that most developers on this list do not mind thinking/arguing about effects or optimizations that are otherwise "tiny". But yeah, they are tiny nonetheless. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nf_conntrack_ipv4 must be loaded explicitly
On Aug 2 2007 20:33, Patrick McHardy wrote: >> End result: >> >> After loading nf_conntrack_ipv4.ko, everything works again (also with the >> "bad" ff09b7). But I have to load it explicitly, and I think that >> unfortunately breaks a lot of setups (such as mine) which assume ipv4 >> connection tracking is always there. > > I already have a patch for this queued. I'll push it upstream once > I get a new power supply for the box I keep that tree on, hopefully > tommorrow. What's that patch looking like? thanks, Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Satyam Sharma wrote: > > [...] > > The point is about *author expecations*. If people do expect atomic_read() > > (or a variant thereof) to have volatile semantics, why not give them such > > a variant? > > Because they should be thinking about them in terms of barriers, over > which the compiler / CPU is not to reorder accesses or cache memory > operations, rather than "special" "volatile" accesses. This is obviously just a taste thing. Whether to have that forget(x) barrier as something author should explicitly sprinkle appropriately in appropriate places in the code by himself or use a primitive that includes it itself. I'm not saying "taste matters aren't important" (they are), but I'm really skeptical if most folks would find the former tasteful. > > And by the way, the point is *also* about the fact that cpu_relax(), as > > of today, implies a full memory clobber, which is not what a lot of such > > loops want. (due to stuff mentioned elsewhere, summarized in that summary) > > That's not the point, That's definitely the point, why not. This is why "barrier()", being heavy-handed, is not the best option. > because as I also mentioned, the logical extention > to Linux's barrier API to handle this is the order(x) macro. Again, not > special volatile accessors. Sure, that forget(x) macro _is_ proposed to be made part of the generic API. Doesn't explain why not to define/use primitives that has volatility semantics in itself, though (taste matters apart). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
recursive use of bust_spinlocks()
Various architectures may call bust_spinlocks() recursively (when calling die() in the context do an unresolved page fault); the function itself, however, doesn't appear to be meant to be called in this manner. Nevertheless, this doesn't appear to be a problem as long as bust_spinlocks(0) doesn't get called twice in a row (otherwise, unblank_screen() may enter the scheduler). However, at least on i386 die() has been capable of returning (and on other architectures this should really be that way, too) when notify_die() returns NOTIFY_STOP. The question now is: Should bust_spinlocks() increment/decrement oops_in_progress (and kick klogd only when the count drops to zero), or should we just avoid entering the scheduler by forcibly setting oops_in_progress to 1 prior to calling unblank_screen(), or should all architectures currently doing so avoid calling bust_spinlocks() recursively? Further, many (if not all) architectures seem to have adopted the recursive die() invocation protection. However, the logic there unconditionally calls spin_unlock_irqrestore() (besides also allowing for bust_spinlocks() to be used recursively), instead of undoing just what had been done for the current function invocation. Suggestions? Thanks, Jan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [another git patch] move USB net drivers to drivers/net
Hi, On May 10 2007 18:12, Jan Engelhardt wrote: >Subject: Re: [another git patch] move USB net drivers to drivers/net > >Hi Jeff, > > >On May 9 2007 21:38, Jeff Garzik wrote: >>diff --git a/drivers/net/Makefile b/drivers/net/Makefile >>index 59c0459..c5d8423 100644 >>--- a/drivers/net/Makefile >>+++ b/drivers/net/Makefile >>@@ -206,6 +206,14 @@ obj-$(CONFIG_TR) += tokenring/ >> obj-$(CONFIG_WAN) += wan/ >> obj-$(CONFIG_ARCNET) += arcnet/ >> obj-$(CONFIG_NET_PCMCIA) += pcmcia/ >>+ >>+obj-$(CONFIG_USB_CATC) += usb/ >>+obj-$(CONFIG_USB_KAWETH)+= usb/ >>+obj-$(CONFIG_USB_PEGASUS) += usb/ >>+obj-$(CONFIG_USB_RTL8150) += usb/ >>+obj-$(CONFIG_USB_USBNET)+= usb/ >>+obj-$(CONFIG_USB_ZD1201)+= usb/ >>+ > >This looks, well, a bit ugly. Since we seem to be going in the direction >of using "menuconfig"s (Kconfig language object) anyway, I propose the >patch below. Please consider, thank you. > >diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig >new file mode 100644 >index 000..3de564b >--- /dev/null >+++ b/drivers/net/usb/Kconfig >@@ -0,0 +1,338 @@ >+# >+# USB Network devices configuration >+# >+comment "Networking support is needed for USB Network Adapter support" >+ depends on USB && !NET > >This comment is superfluous. If !NET, then NETDEVICES is not even available. >Since you just moved/renamed it, you are not to blame. Though, I remove >it in below's patch (applies on top of >commit 5b2fc499917e5897a13add780e181b4cef197072). > > >Signed-off-by: Jan Engelhardt <[EMAIL PROTECTED]> > >diff --git a/drivers/net/Makefile b/drivers/net/Makefile >index c5d8423..5f6112d 100644 >--- a/drivers/net/Makefile >+++ b/drivers/net/Makefile >@@ -207,12 +207,7 @@ obj-$(CONFIG_WAN) += wan/ > obj-$(CONFIG_ARCNET) += arcnet/ > obj-$(CONFIG_NET_PCMCIA) += pcmcia/ > >-obj-$(CONFIG_USB_CATC) += usb/ >-obj-$(CONFIG_USB_KAWETH)+= usb/ >-obj-$(CONFIG_USB_PEGASUS) += usb/ >-obj-$(CONFIG_USB_RTL8150) += usb/ >-obj-$(CONFIG_USB_USBNET)+= usb/ >-obj-$(CONFIG_USB_ZD1201)+= usb/ >+obj-$(CONFIG_NETDEV_USB) += usb/ > > obj-y += wireless/ > obj-$(CONFIG_NET_TULIP) += tulip/ >diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig >index 3de564b..2458f99 100644 >--- a/drivers/net/usb/Kconfig >+++ b/drivers/net/usb/Kconfig >@@ -1,11 +1,12 @@ > # > # USB Network devices configuration > # >-comment "Networking support is needed for USB Network Adapter support" >- depends on USB && !NET >- >-menu "USB Network Adapters" >+menuconfig NETDEVICES_USB >+ bool "USB Network Adapters" > depends on USB && NET >+ default y >+ >+if NETDEVICES_USB > > config USB_CATC > tristate "USB CATC NetMate-based Ethernet device support (EXPERIMENTAL)" >@@ -334,5 +335,4 @@ config USB_NET_ZAURUS > really need this non-conformant variant of CDC Ethernet (or in > some cases CDC MDLM) protocol, not "g_ether". > >- >-endmenu >+endif # NETDEVICES_USB ># > > Jan >-- Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] add kdump_after_notifier
On Thu, Aug 16, 2007 at 06:26:35PM +0900, Takenori Nagano wrote: > Vivek Goyal wrote: > > So for the time being I think we can put RAS tools on die notifier list > > and if it runs into issues we can always think of creating a separate list. > > > > Few things come to mind. > > > > - Why there is a separate panic_notifier_list? Can't it be merged with > > die_chain? die_val already got one of the event type as PANIC. If there > > are no specific reasons then we should merge the two lists. Registering > > RAS tools on a single list is easier. > > I think it is difficult, because die_chain is defined by each architecture. > I think die_chain is arch independent definition (kernel/die_notifier.c)? But anyway, to begin with it can be done only for panic_notifier. > > - Modify Kdump to register on die_chain list. > > - Modify Kdb to register on die_chain list. > > - Export all the registered members of die_chain through sysfs along with > > their priorities. Priorities should be modifiable. Most likely one > > shall have to introduce additional field in struct notifier_block. This > > field will be a string as an identifier of the user registerd. e.g > > "Kdump", "Kdb" etc. > > > > Now user will be able to view all the die_chain users through sysfs and > > be able to modify the order in which these should run by modifying their > > priority. Hence all the RAS tools can co-exist. > > This is my image of your proposal. > > - Print current order > > # cat /sys/class/misc/debug/panic_notifier_list > priority name > 1 IPMI > 2 watchdog > 3 Kdb > 4 Kdump > I think Bernhard's suggestion looks better here. I noticed that /sys/kernel/debug is already present. So how about following. /sys/kernel/debug/kdump/priority /sys/kernel/debug/kdb/priority /sys/kernel/debug/IPMI/priority I think at some point of time we shall have to create another file say description. /sys/kernel/debug/IPMI/description Which can tell what does this tool do? Other a user might not have any clue how to prioritize various things. Thanks Vivek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
I wrote: > Nick Piggin wrote: >> You might find that these places that appear to need barriers are >> buggy for other reasons anyway. Can you point to some in-tree code >> we can have a look at? > > I could, or could not, if I were through with auditing the code. I > remembered one case and posted it (nodemgr_host_thread) which was safe > because msleep_interruptible provided the necessary barrier there, and > this implicit barrier is not in danger to be removed by future patches. PS, just in case anybody holds his breath for more example code from me, I don't plan to continue with an actual audit of the drivers I maintain. It's an important issue, but my current time budget will restrict me to look at it ad hoc, per case. (Open bugs have higher priority than potential bugs.) -- Stefan Richter -=-=-=== =--- =---= http://arcgraph.de/sr/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rtc: Make rtc-ds1742 driver hotplug-aware
On Aug 17 2007 01:06, Atsushi Nemoto wrote: >Add an MODULE_ALIAS() to make this platform driver hotplug-aware. > >Signed-off-by: Atsushi Nemoto <[EMAIL PROTECTED]> >--- >diff --git a/drivers/rtc/rtc-ds1742.c b/drivers/rtc/rtc-ds1742.c >index b2e5481..4bd22dc 100644 >--- a/drivers/rtc/rtc-ds1742.c >+++ b/drivers/rtc/rtc-ds1742.c >@@ -273,3 +273,4 @@ MODULE_AUTHOR("Atsushi Nemoto <[EMAIL PROTECTED]>"); > MODULE_DESCRIPTION("Dallas DS1742 RTC driver"); > MODULE_LICENSE("GPL"); > MODULE_VERSION(DRV_VERSION); >+MODULE_ALIAS("ds1742"); Why exactly is this needed? What script refers to the module as ds1742 instead of rtc-ds1742? Regular hotplug (e.g. udev/modprobe) also go by PCI ID or whatever is applicable and load the module which provides support for said ID. Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]: proc: export a processes resource limits via proc/
On Fri, Aug 17, 2007 at 04:09:26AM -0400, [EMAIL PROTECTED] wrote: > On Thu, 16 Aug 2007 08:35:38 EDT, Neil Horman said: > > Hey again- > > Andrew requested that I repost this cleanly, after running the patch > > through checkpatch. As requested here it is with the changelog. > > > > Currently, there exists no method for a process to query the resource > > limits of another process. They can be inferred via some mechanisms but > > they > > cannot be explicitly determined. Given that this information can be > > usefull > to > > know during the debugging of an application, I've written this patch which > > exports all of a processes limits via /proc//limits. > > > > Tested successfully by myself on x86 on top of 2.6.23-rc2-mm1. > > I had only one comment the first time around, and Neil addressed it. > > I've also tested on x86_64 23-rc2-mm1, and it works here too. I saw where > this > uses units of 'bytes' while the shell 'ulimit' uses 1024-byte units in some > places, but (a) this lists the units and (b) it's consistent with setrlimit(). > Testing with values >4G show it's 64-bit clean as well. > > One question: Is the units milliseconds, or seconds here: > > + [RLIMIT_CPU] = {"Max cpu time", "ms"}, > > Other than that, feel free to stick either/both of these on: > > Reviewed-By: Valdis Kletnieks <[EMAIL PROTECTED]> > Tested-By: Valdis Kletnieks <[EMAIL PROTECTED]> Oops, your right, Thanks. New patch Currently, there exists no method for a process to query the resource limits of another process. They can be inferred via some mechanisms but they cannot be explicitly determined. Given that this information can be usefull to know during the debugging of an application, I've written this patch which exports all of a processes limits via /proc//limits. Thanks Neil Signed-off-by: Neil Horman <[EMAIL PROTECTED]> base.c | 69 + 1 file changed, 69 insertions(+) commit 95e91e3102d7963e63d68969b8db5694b83a6684 Author: Neil Horman <[EMAIL PROTECTED]> Date: Thu Aug 16 08:25:59 2007 -0400 Readding proc_pid_limits patch with checkpatch.pl run diff --git a/fs/proc/base.c b/fs/proc/base.c index ed2b224..4fb34a5 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -74,6 +74,7 @@ #include #include #include +#include #include "internal.h" /* NOTE: @@ -323,6 +324,72 @@ static int proc_oom_score(struct task_struct *task, char *buffer) return sprintf(buffer, "%lu\n", points); } +struct limit_names { + char *name; + char *unit; +}; + +static const struct limit_names lnames[RLIM_NLIMITS] = { + [RLIMIT_CPU] = {"Max cpu time", "seconds"}, + [RLIMIT_FSIZE] = {"Max file size", "bytes"}, + [RLIMIT_DATA] = {"Max data size", "bytes"}, + [RLIMIT_STACK] = {"Max stack size", "bytes"}, + [RLIMIT_CORE] = {"Max core file size", "bytes"}, + [RLIMIT_RSS] = {"Max resident set", "bytes"}, + [RLIMIT_NPROC] = {"Max processes", "processes"}, + [RLIMIT_NOFILE] = {"Max open files", "files"}, + [RLIMIT_MEMLOCK] = {"Max locked memory", "bytes"}, + [RLIMIT_AS] = {"Max address space", "bytes"}, + [RLIMIT_LOCKS] = {"Max file locks", "locks"}, + [RLIMIT_SIGPENDING] = {"Max pending signals", "signals"}, + [RLIMIT_MSGQUEUE] = {"Max msgqueue size", "bytes"}, + [RLIMIT_NICE] = {"Max nice priority", NULL}, + [RLIMIT_RTPRIO] = {"Max realtime priority", NULL}, +}; + +/* Display limits for a process */ +static int proc_pid_limits(struct task_struct *task, char *buffer) +{ + unsigned int i; + int count = 0; + char *bufptr = buffer; + + struct rlimit rlim[RLIM_NLIMITS]; + + read_lock(&tasklist_lock); + memcpy(rlim, task->signal->rlim, sizeof(struct rlimit) * RLIM_NLIMITS); + read_unlock(&tasklist_lock); + + /* +* print the file header +*/ + count += sprintf(&bufptr[count], "%-25s %-20s %-20s %-10s\n", + "Limit", "Soft Limit", "Hard Limit", "Units"); + + for (i = 0; i < RLIM_NLIMITS; i++) { + if (rlim[i].rlim_cur == RLIM_INFINITY) + count += sprintf(&bufptr[count], "%-25s %-20s ", +lnames[i].name, "unlimited"); + else + count += sprintf(&bufptr[count], "%-25s %-20lu ", +lnames[i].name, rlim[i].rlim_cur); + + if (rlim[i].rlim_max == RLIM_INFINITY) + count += sprintf(&bufptr[count], "%-20s ", "unlimited"); + else + count += sprintf(&bufptr[count], "%-20lu ", +rlim[i].rlim_max); + + if (lnames[i].unit) + count += sprintf(&bufptr[count], "%-10s\n", +lnames[i].unit); + else +
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Nick Piggin wrote: > Stefan Richter wrote: >> For architecture port authors, there is Documentation/atomic_ops.txt. >> Driver authors also can learn something from that document, as it >> indirectly documents the atomic_t and bitops APIs. > > "Semantics and Behavior of Atomic and Bitmask Operations" is > pretty direct :) "Indirect", "pretty direct"... It's subjective. (It is not an API documentation; it is an implementation specification.) > Sure, it says that it's for arch maintainers, but there is no > reason why users can't make use of it. > > >> Prompted by this thread, I reread this document, and indeed, the >> sentence "Unlike the above routines, it is required that explicit memory >> barriers are performed before and after [atomic_{inc,dec}_return]" >> indicates that atomic_read (one of the "above routines") is very >> different from all other atomic_t accessors that return values. >> >> This is strange. Why is it that atomic_read stands out that way? IMO > > It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set. Yes, but unlike these, atomic_read returns a value. Without me (the API user) providing extra barriers, that value may become something else whenever someone touches code in the vicinity of the atomic_read. >> this API imbalance is quite unexpected by many people. Wouldn't it be >> beneficial to change the atomic_read API to behave the same like all >> other atomic_t accessors that return values? > > It is very consistent and well defined. Operations which both modify > the data _and_ return something are defined to have full barriers > before and after. You are right, atomic_read is not only different from accessors that don't retunr values, it is also different from all other accessors that return values (because they all also modify the value). There is just no actual API documentation, which contributes to the issue that some people (or at least one: me) learn a little bit late how special atomic_read is. > What do you want to add to the other atomic accessors? Full memory > barriers? Only compiler barriers? It's quite likely that if you think > some barriers will fix bugs, then there are other bugs lurking there > anyway. A lot of different though related issues are discussed in this thread, but I personally am only occupied by one particular thing: What kind of return values do I get from atomic_read. > Just use spinlocks if you're not absolutely clear about potential > races and memory ordering issues -- they're pretty cheap and simple. Probably good advice, like generally if driver guys consider lockless algorithms. >> OK, it is also different from the other accessors that return data in so >> far as it doesn't modify the data. But as driver "author", i.e. user of >> the API, I can't see much use of an atomic_read that can be reordered >> and, more importantly, can be optimized away by the compiler. > > It will return to you an atomic snapshot of the data (loaded from > memory at some point since the last compiler barrier). All you have > to be aware of compiler barriers and the Linux SMP memory ordering > model, which should be a given if you are writing lockless code. OK, that's what I slowly realized during this discussion, and I appreciate the explanations that were given here. >> Sure, now >> that I learned of these properties I can start to audit code and insert >> barriers where I believe they are needed, but this simply means that >> almost all occurrences of atomic_read will get barriers (unless there >> already are implicit but more or less obvious barriers like msleep). > > You might find that these places that appear to need barriers are > buggy for other reasons anyway. Can you point to some in-tree code > we can have a look at? I could, or could not, if I were through with auditing the code. I remembered one case and posted it (nodemgr_host_thread) which was safe because msleep_interruptible provided the necessary barrier there, and this implicit barrier is not in danger to be removed by future patches. -- Stefan Richter -=-=-=== =--- =---= http://arcgraph.de/sr/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nanosleep() accuracy
GolovaSteek wrote: > 2007/8/17, Michal Schmidt <[EMAIL PROTECTED]>: > >> GolovaSteek skrev: >> >>> Hello! >>> I need use sleep with accurat timing. >>> I use 2.6.21 with rt-prempt patch. >>> with enabled rt_preempt, dyn_ticks, and local_apic >>> But >>> >>> req.tv_nsec = 30; >>> req.tv_sec = 0; >>> nanosleep(&req,NULL) >>> >>> make pause around 310-330 microseconds. >>> >> How do you measure this? >> If you want to have something done every 300 microseconds, you must not >> sleep for 300 microseconds in each iteration, because you'd accumulate >> errors. Use a periodic timer or use the current time to compute how long >> to sleep in each iteration. Take a look how cyclictest does it. >> > > no. I just want my programm go to sleep sometimes and wake up in correct time. > What does your program do that it has such a strict requirement on the exact length of sleeping? >>> I tried to understend how work nanosleep(), but it not depends from >>> jiffies and from smp_apic_timer_interrupt. >>> >>> When can accuracy be lost? >>> And how are process waked up? >>> >>> >>> GolovaSteek >>> >> Don't forget the process will always have non-zero wakeup latency. It >> takes some time to process an interrupt, wakeup the process and schedule >> it to run on the CPU. 10-30 microseconds is not unreasonable. >> > > But 2 operations can be done in 10 microseconds? > and why is there that inconstancy? Why sametimes 10 and sometimes 30? > In which points of implementation it happens? > > GolovaSteek > If a jitter of 20 microseconds is unacceptable for your application, don't use PC hardware. Consider using a microcontroller. Michal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Nick Piggin wrote: > Satyam Sharma wrote: >> And we have driver / subsystem maintainers such as Stefan >> coming up and admitting that often a lot of code that's written to use >> atomic_read() does assume the read will not be elided by the compiler. > > So these are broken on i386 and x86-64? The ieee1394 and firewire subsystems have open, undiagnosed bugs, also on i386 and x86-64. But whether there is any bug because of wrong assumptions about atomic_read among them, I don't know. I don't know which assumptions the authors made, I only know that I wasn't aware of all the properties of atomic_read until now. > Are they definitely safe on SMP and weakly ordered machines with > just a simple compiler barrier there? Because I would not be > surprised if there are a lot of developers who don't really know > what to assume when it comes to memory ordering issues. > > This is not a dig at driver writers: we still have memory ordering > problems in the VM too (and probably most of the subtle bugs in > lockless VM code are memory ordering ones). Let's not make up a > false sense of security and hope that sprinkling volatile around > will allow people to write bug-free lockless code. If a writer > can't be bothered reading API documentation ...or, if there is none, the implementation specification (as in case of the atomic ops), or, if there is none, the implementation (as in case of a some infrastructure code here and there)... > and learning the Linux memory model, they can still be productive > writing safely locked code. Provided they are aware that they might not have the full picture of the lockless primitives. :-) -- Stefan Richter -=-=-=== =--- =---= http://arcgraph.de/sr/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Unable to handle kernel NULL pointer dereference at 0000000000000001 RIP
Hello, our SLES production system 2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 UTC 2006 x86_64 x86_64 x86_64 stopped with the following messages in the log Is this a known bug in this version and is it solved in the next versions. As I wrote it is a production server so I can't change away from reiserfs. Thanks Andreas Moroder Aug 16 22:00:07 falcon1 su: pam_unix2: session started for user oracle, service su Aug 16 22:00:37 falcon1 kernel: Unable to handle kernel NULL pointer dereference at 0001 RIP: Aug 16 22:00:37 falcon1 kernel: {:reiserfs:reiserfs_clear_inode+22} Aug 16 22:00:37 falcon1 kernel: PML4 12c07a067 PGD 103721067 PMD 0 Aug 16 22:00:37 falcon1 kernel: Oops: 0002 [1] SMP Aug 16 22:00:37 falcon1 kernel: CPU 2 Aug 16 22:00:37 falcon1 kernel: Pid: 224, comm: kswapd0 Tainted: P UM (2.6.5-7.252-smp SLES9_SP3_BRANCH-2006021404) Aug 16 22:00:37 falcon1 kernel: RIP: 0010:[] {:reiserfs:reiserfs_clear_inode+22} Aug 16 22:00:37 falcon1 kernel: RSP: :01013fdfdc08 EFLAGS: 00010213 Aug 16 22:00:37 falcon1 kernel: RAX: a00b3d10 RBX: 0100b8fd49f8 RCX: 8053fa80 Aug 16 22:00:37 falcon1 kernel: RDX: 01013fdfdc78 RSI: 0100b8fd4680 RDI: 0001 Aug 16 22:00:37 falcon1 kernel: RBP: 0100b8fd49f8 R08: 01000a75c810 R09: 0001 Aug 16 22:00:37 falcon1 kernel: R10: R11: fb11 R12: 01013fdfdc78 Aug 16 22:00:37 falcon1 kernel: R13: 0031 R14: 803e6b20 R15: 01013fdfdc78 Aug 16 22:00:37 falcon1 kernel: FS: () GS:8057d000() knlGS: Aug 16 22:00:37 falcon1 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b Aug 16 22:00:37 falcon1 kernel: CR2: 0001 CR3: 0a6b1000 CR4: 06e0 Aug 16 22:00:37 falcon1 kernel: Process kswapd0 (pid: 224, threadinfo 01013fdfc000, task 01013fe2aa40) Aug 16 22:00:37 falcon1 kernel: Stack: 0100b8fd49f8 801aa7eb 0100b8fd4a08 801aa898 Aug 16 22:00:37 falcon1 kernel:010005147530 010046a47d08 010046a47cf8 0080 Aug 16 22:00:37 falcon1 kernel:0080 801aab17 Aug 16 22:00:37 falcon1 kernel: Call Trace:{clear_inode+155} {dispose_list+104} Aug 16 22:00:37 falcon1 kernel: {shrink_icache_memory+471} {shrink_slab+245} Aug 16 22:00:37 falcon1 kernel: {balance_pgdat+477} {kswapd+390} Aug 16 22:00:37 falcon1 kernel:{do_exit+3505} {autoremove_wake_function+0} Aug 16 22:00:37 falcon1 kernel: {autoremove_wake_function+0} {child_rip+8} Aug 16 22:00:37 falcon1 kernel:{kswapd+0} {child_rip+0} Aug 16 22:00:37 falcon1 kernel: Aug 16 22:00:37 falcon1 kernel: Aug 16 22:00:37 falcon1 kernel: Code: f0 ff 0f 0f 94 c0 84 c0 74 05 e8 0b a4 0b e0 48 8b 7b e0 48 Aug 16 22:00:37 falcon1 kernel: RIP {:reiserfs:reiserfs_clear_inode+22} RSP <01013fdfdc08> Aug 16 22:00:37 falcon1 kernel: CR2: 0001 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Unable to handle kernel NULL pointer dereference at 0000000000000001 RIP
On Fri, 17 Aug 2007, Andreas Moroder wrote: > our SLES production system 2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 > UTC 2006 x86_64 x86_64 x86_64 stopped with the following messages in the > log Is this a known bug in this version and is it solved in the next > versions. Hi, this is vendor-provided kernel, so this is pretty much offtopic on lkml -- please file a bugreport in Novell bugzilla, which is a proper place for this. Thanks, -- Jiri Kosina - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.23-rc3] NFSv4 client oops
On Wed, 2007-08-15 at 18:08 -0400, Jeff Garzik wrote: > While upgrading nfs-utils on my NFSv4 file server (F7 x86-64 > 2.6.23-rc3), I got an oops on the NFSv4 client (FC6 x86-64 2.6.23-rc3), > and communications stopped. > > I rebooted the client, and everything was fine again. Then, on the > server, I removed an extra nfs-utils package left over from FC6, causing > the nfs daemons to be restarted. Another oops. > > So it seems like whatever Fedora does during an nfs-utils upgrade __on > the server__ is causing a remote oops __on the client__. > > NFSv4 client kernel config also attached. > > Never seen this problem before, but then again, never did a remote OS > upgrade of the fileserver while running NFSv4 before. > > Jeff > > > plain text document attachment (log.txt) > Aug 15 15:44:23 core kernel: nfs: server pretzel not responding, timed out > Aug 15 15:44:23 core last message repeated 8 times > Aug 15 15:44:23 core kernel: [ cut here ] > Aug 15 15:44:23 core kernel: kernel BUG at fs/nfs/nfs4xdr.c:1040! > Aug 15 15:44:23 core kernel: invalid opcode: [1] SMP > Aug 15 15:44:23 core kernel: CPU 0 > Aug 15 15:44:23 core kernel: Modules linked in: nfs lockd sunrpc af_packet > ipv6 cpufreq_ondemand acpi_cpufreq battery floppy nvram sg snd_hda_intel > ata_generic snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_page_alloc e1000 > firewire_ohci ata_piix i2c_core sr_mod cdrom sata_sil ahci libata sd_mod > scsi_mod ext3 jbd ehci_hcd uhci_hcd > Aug 15 15:44:23 core kernel: Pid: 16353, comm: 10.10.10.1-recl Not tainted > 2.6.23-rc3 #1 > Aug 15 15:44:23 core kernel: RIP: 0010:[] > [] :nfs:encode_open+0x1c0/0x330 > Aug 15 15:44:23 core kernel: RSP: 0018:8100467c5c60 EFLAGS: 00010202 > Aug 15 15:44:23 core kernel: RAX: 81000f89b8b8 RBX: 697a6f6d RCX: > 81000f89b8b8 > Aug 15 15:44:23 core kernel: RDX: 0004 RSI: 0004 RDI: > 8100467c5c80 > Aug 15 15:44:23 core kernel: RBP: 8100467c5c80 R08: 81000f89bc30 R09: > 81000f89b83f > Aug 15 15:44:23 core kernel: R10: 0001 R11: 881e79e0 R12: > 81003cbd1808 > Aug 15 15:44:23 core kernel: R13: 81000f89b860 R14: 81005fc984e0 R15: > 88240af0 > Aug 15 15:44:23 core kernel: FS: () > GS:8052a000() knlGS: > Aug 15 15:44:23 core kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b > Aug 15 15:44:23 core kernel: CR2: 2adb9e51a030 CR3: 7ea7e000 CR4: > 06e0 > Aug 15 15:44:23 core kernel: DR0: DR1: DR2: > > Aug 15 15:44:23 core kernel: DR3: DR6: 0ff0 DR7: > 0400 > Aug 15 15:44:23 core kernel: Process 10.10.10.1-recl (pid: 16353, threadinfo > 8100467c4000, task 8100038ce780) > Aug 15 15:44:23 core kernel: Stack: 81004aeb6a40 81003cbd1808 > 81003cbd1808 88240b5d > Aug 15 15:44:23 core kernel: 81000f89b8bc 81005fc984e8 > 81000f89bc30 81005fc984e8 > Aug 15 15:44:23 core kernel: 0003 > 81003cbd1800 > Aug 15 15:44:23 core kernel: Call Trace: > Aug 15 15:44:23 core kernel: [] > :nfs:nfs4_xdr_enc_open_noattr+0x6d/0x90 > Aug 15 15:44:23 core kernel: [] > :sunrpc:rpcauth_wrap_req+0x97/0xf0 > Aug 15 15:44:23 core kernel: [] > :nfs:nfs4_xdr_enc_open_noattr+0x0/0x90 > Aug 15 15:44:23 core kernel: [] > :sunrpc:call_transmit+0x18a/0x290 > Aug 15 15:44:23 core kernel: [] > :sunrpc:__rpc_execute+0x6b/0x290 > Aug 15 15:44:23 core kernel: [] > :sunrpc:rpc_do_run_task+0x76/0xd0 > Aug 15 15:44:23 core kernel: [] > :nfs:_nfs4_proc_open+0x76/0x230 > Aug 15 15:44:23 core kernel: [] > :nfs:nfs4_open_recover_helper+0x5e/0xc0 > Aug 15 15:44:23 core kernel: [] > :nfs:nfs4_open_recover+0xe4/0x120 > Aug 15 15:44:23 core kernel: [] > :nfs:nfs4_open_reclaim+0xa4/0xf0 > Aug 15 15:44:23 core kernel: [] > :nfs:nfs4_reclaim_open_state+0x55/0x1b0 > Aug 15 15:44:23 core kernel: [] :nfs:reclaimer+0x2ca/0x390 > Aug 15 15:44:23 core kernel: [] :nfs:reclaimer+0x0/0x390 > Aug 15 15:44:23 core kernel: [] kthread+0x4b/0x80 > Aug 15 15:44:23 core kernel: [] child_rip+0xa/0x12 > Aug 15 15:44:23 core kernel: [] kthread+0x0/0x80 > Aug 15 15:44:23 core kernel: [] child_rip+0x0/0x12 > Aug 15 15:44:23 core kernel: > Aug 15 15:44:23 core kernel: > Aug 15 15:44:23 core kernel: Code: 0f 0b eb fe 48 89 ef c7 00 00 00 00 02 be > 08 00 00 00 e8 79 > Aug 15 15:44:23 core kernel: RIP [] > :nfs:encode_open+0x1c0/0x330 > Aug 15 15:44:23 core kernel: RSP > Aug 15 15:44:23 core kernel: nfs: server pretzel OK > Aug 15 15:45:02 core shutdown[16390]: shutting down for system reboot > > [...] OK. That probably indicates an error in the calculation of NFS4_enc_open_noattr_sz (it that the resulting value is too small). I'll have a look this weekend... Cheers Trond - To unsubscribe from thi
Re: [PATCH/RFC 4/4, second shot]KVM uses "account_guest_time()"
KVM updates vtime in task_struct to allow account_guest_time() to modify user, system and guest time in cpustat accordingly. Index: kvm/drivers/kvm/Kconfig === --- kvm.orig/drivers/kvm/Kconfig2007-08-17 10:24:46.0 +0200 +++ kvm/drivers/kvm/Kconfig 2007-08-17 10:25:25.0 +0200 @@ -41,4 +41,10 @@ Provides support for KVM on AMD processors equipped with the AMD-V (SVM) extensions. +config GUEST_ACCOUNTING + bool "Virtual Machine accounting support" + depends on KVM + ---help--- + Allows to account CPU time used by the Virtual Machines. + endif # VIRTUALIZATION Index: kvm/drivers/kvm/kvm.h === --- kvm.orig/drivers/kvm/kvm.h 2007-08-17 10:19:30.0 +0200 +++ kvm/drivers/kvm/kvm.h 2007-08-17 10:21:56.0 +0200 @@ -589,6 +589,19 @@ int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +#ifdef CONFIG_GUEST_ACCOUNTING +static inline ktime_t kvm_guest_enter(void) +{ +return ktime_get(); +} + +static inline void kvm_guest_exit(ktime_t enter) +{ +ktime_t delta = ktime_sub(ktime_get(), enter); +current->vtime = ktime_add(current->vtime, delta); +} +#endif + static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) { Index: kvm/drivers/kvm/svm.c === --- kvm.orig/drivers/kvm/svm.c 2007-08-17 10:22:07.0 +0200 +++ kvm/drivers/kvm/svm.c 2007-08-17 10:23:32.0 +0200 @@ -1392,6 +1392,9 @@ u16 gs_selector; u16 ldt_selector; int r; +#ifdef CONFIG_GUEST_ACCOUNTING + ktime_t now; +#endif again: r = kvm_mmu_reload(vcpu); @@ -1404,6 +1407,9 @@ clgi(); vcpu->guest_mode = 1; +#ifdef CONFIG_GUEST_ACCOUNTING + now = kvm_guest_enter(); +#endif if (vcpu->requests) if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests)) svm_flush_tlb(vcpu); @@ -1536,6 +1542,9 @@ #endif : "cc", "memory" ); +#ifdef CONFIG_GUEST_ACCOUNTING + kvm_guest_exit(now); +#endif vcpu->guest_mode = 0; if (vcpu->fpu_active) { Index: kvm/drivers/kvm/vmx.c === --- kvm.orig/drivers/kvm/vmx.c 2007-08-17 10:23:36.0 +0200 +++ kvm/drivers/kvm/vmx.c 2007-08-17 10:24:37.0 +0200 @@ -2052,6 +2052,9 @@ struct vcpu_vmx *vmx = to_vmx(vcpu); u8 fail; int r; +#ifdef CONFIG_GUEST_ACCOUNTING + ktime_t now; +#endif preempted: if (vcpu->guest_debug.enabled) @@ -2078,6 +2081,9 @@ local_irq_disable(); vcpu->guest_mode = 1; +#ifdef CONFIG_GUEST_ACCOUNTING + now = kvm_guest_enter(); +#endif if (vcpu->requests) if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests)) vmx_flush_tlb(vcpu); @@ -2198,6 +2204,9 @@ [cr2]"i"(offsetof(struct kvm_vcpu, cr2)) : "cc", "memory" ); +#ifdef CONFIG_GUEST_ACCOUNTING + kvm_guest_exit(now); +#endif vcpu->guest_mode = 0; local_irq_enable();
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Satyam Sharma wrote: It is very obvious. msleep calls schedule() (ie. sleeps), which is always a barrier. Probably you didn't mean that, but no, schedule() is not barrier because it sleeps. It's a barrier because it's invisible. Where did I say it is a barrier because it sleeps? It is always a barrier because, at the lowest level, schedule() (and thus anything that sleeps) is defined to always be a barrier. Regardless of whatever obscure means the compiler might need to infer the barrier. In other words, you can ignore those obscure details because schedule() is always going to have an explicit barrier in it. The "unobvious" thing is that you wanted to know how the compiler knows a function is a barrier -- answer is that if it does not *know* it is not a barrier, it must assume it is a barrier. True, that's clearly what happens here. But are you're definitely joking that this is "obvious" in terms of code-clarity, right? No. If you accept that barrier() is implemented correctly, and you know that sleeping is defined to be a barrier, then its perfectly clear. You don't have to know how the compiler "knows" that some function contains a barrier. Just 5 minutes back you mentioned elsewhere you like seeing lots of explicit calls to barrier() (with comments, no less, hmm? :-) Sure, but there are well known primitives which contain barriers, and trivial recognisable code sequences for which you don't need comments. waiting-loops using sleeps or cpu_relax() are prime examples. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NOMMU: Separate out VMAs
David Howells wrote: David Howells <[EMAIL PROTECTED]> wrote: Yes. I found the major leak this morning. There may be a minor leak, but I'm not convinced it's in the mmap stuff. See revised patch. Oops. That was the old patch. Try this one instead. Here are some changes to make it more stable on my system. In do_mmap_private, I've commented out the logic to free excess pages, as it fragments terribly and causes a simple while true; do cat /proc/buddyinfo; done loop to go oom. Also, I think you're freeing high-order pages unaligned to their order? In shrink_vma, we must save the mm across calls to remove_vma_from_mm (oops when telnetting into the box). In do_munmap, we can deal with freeing more than one vma. I've not touched the rb-tree logic in the shared file case, as I have no idea what it's trying to do given that only exact matches are allowed. It still does not survive my mmap stress-tester, so I'll keep looking. Why do we need vm_regions for anonymous memory? Wouldn't it be enough to just have a VMA? Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif diff --width=180 -x '*~' -x '.*' -x 'Sys*' -dru linux-2.6.x-daves-baseline/mm/nommu.c linux-2.6.x/mm/nommu.c --- linux-2.6.x-daves-baseline/mm/nommu.c 2007-08-14 21:20:43.0 +0200 +++ linux-2.6.x/mm/nommu.c 2007-08-16 19:11:29.0 +0200 @@ -922,19 +931,23 @@ total = 1 << order; atomic_add(total, &mmap_pages_allocated); - point = len >> PAGE_SHIFT; - while (point < total) { - order = ilog2(total - point); - _debug("shave %u/%lu", 1 << order, total - point); - atomic_sub(1 << order, &mmap_pages_allocated); - __free_pages(pages + point, order); - point += 1 << order; + len = PAGE_SIZE << order; +#if 0 + while (total > (len >> PAGE_SHIFT)) { + unsigned long to_free; + order = ilog2(total - (len >> PAGE_SHIFT)); + to_free = 1 << order; + _debug("shave %u/%lu", to_free, total - (len >> PAGE_SHIFT)); + atomic_sub(to_free, &mmap_pages_allocated); + __free_pages(pages + total - to_free, order); + total -= to_free; } total = len >> PAGE_SHIFT; for (point = 1; point < total; point++) set_page_refcounted(&pages[point]); - +#endif + split_page(pages, order); base = page_address(pages); region->vm_start = vma->vm_start = (unsigned long) base; region->vm_end = vma->vm_end = vma->vm_start + len; @@ -1294,6 +1307,7 @@ unsigned long from, unsigned long to) { struct vm_region *region; + struct mm_struct *mm = vma->vm_mm; _enter(""); @@ -1304,7 +1318,7 @@ vma->vm_end = from; else vma->vm_start = to; - add_vma_to_mm(vma->vm_mm, vma); + add_vma_to_mm(mm, vma); /* cut the region down to size */ region = vma->vm_region; @@ -1337,44 +1351,59 @@ _enter(",%lx,%zx", start, len); - /* find the first potentially overlapping VMA */ - vma = find_vma(mm, start); - if (!vma) { - _leave(" = -EINVAL [no]"); - return -EINVAL; - } - - /* we're allowed to split an anonymous VMA but not a file-backed one */ - if (vma->vm_file) { - do { - if (start > vma->vm_start) { - _leave(" = -EINVAL [miss]"); - return -EINVAL; - } - if (end == vma->vm_end) - goto erase_whole_vma; - rb = rb_next(&vma->vm_rb); - vma = rb_entry(rb, struct vm_area_struct, vm_rb); - } while (rb); - _leave(" = -EINVAL [split file]"); - return -EINVAL; - } else { - /* the region must be a subset of the VMA found */ - if (start == vma->vm_start && end == vma->vm_end) - goto erase_whole_vma; - if (start < vma->vm_start || end > vma->vm_end) { + while (start < end) { + unsigned long this_end; + /* find the first potentially overlapping VMA */ + vma = find_vma(mm, start); + if (!vma) { + _leave(" = -EINVAL [no]"); + return -EINVAL; + } + if (start < vma->vm_start) { _leave(" = -EINVAL [superset]"); return -EINVAL; } - if (start != vma->vm_start && end != vma->vm_en
[PATCH/RFC 3/4, second shot]Introduce "account_guest_time"
This is another way to compute guest time... I remove the "account modifiers" mechanism and call directly account_guest_time() from account_system_time(). account_system_time() computes user, system and guest times according value accumulated in vtime (a ktime_t) in task_struct by the virtual machine. -- - [EMAIL PROTECTED] -- "Software is hard" - Donald Knuth Index: kvm/include/linux/sched.h === --- kvm.orig/include/linux/sched.h 2007-08-17 10:18:53.0 +0200 +++ kvm/include/linux/sched.h 2007-08-17 12:33:22.0 +0200 @@ -1192,6 +1192,9 @@ #ifdef CONFIG_FAULT_INJECTION int make_it_fail; #endif +#ifdef CONFIG_GUEST_ACCOUNTING + ktime_t vtime; +#endif }; /* Index: kvm/kernel/sched.c === --- kvm.orig/kernel/sched.c 2007-08-17 10:18:53.0 +0200 +++ kvm/kernel/sched.c 2007-08-17 12:33:07.0 +0200 @@ -3233,6 +3233,37 @@ cpustat->user = cputime64_add(cpustat->user, tmp); } +#ifdef CONFIG_GUEST_ACCOUNTING +/* + * Account guest time to a process + * @p: the process that the cpu time gets accounted to + * @cputime: the cpu time spent in kernel space since the last update + */ + +static cputime_t account_guest_time(struct task_struct *p, cputime_t cputime) +{ + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + ktime_t kmsec = ktime_set(0, NSEC_PER_MSEC); + cputime_t cmsec = msecs_to_cputime(1); + + while ((ktime_to_ns(p->vtime) >= NSEC_PER_MSEC) && + (cputime_to_msecs(cputime) >= 1)) { + p->vtime = ktime_sub(p->vtime, kmsec); + p->utime = cputime_add(p->utime, cmsec); + p->gtime = cputime_add(p->gtime, cmsec); + + cpustat->guest = cputime64_add(cpustat->guest, + cputime_to_cputime64(cmsec)); + cpustat->user = cputime64_add(cpustat->user, + cputime_to_cputime64(cmsec)); + + cputime = cputime_sub(cputime, cmsec); + } + + return cputime; +} +#endif + /* * Account system cpu time to a process. * @p: the process that the cpu time gets accounted to @@ -3246,6 +3277,10 @@ struct rq *rq = this_rq(); cputime64_t tmp; +#ifdef CONFIG_GUEST_ACCOUNTING + cputime = account_guest_time(p, cputime); +#endif + p->stime = cputime_add(p->stime, cputime); /* Add system time to cpustat. */
Re: [PATCH] Make checkpatch rant about trailing ; at the end of "if" expr
On Aug 16 2007 10:21, Andy Whitcroft wrote: >> +if ($line =~ /\bif\s*\([^\)]*\)\s*\;/) { > >Heh, you are the second person to suggest this check today, do I detect >some ripped out hair due to one of these! > >I've taken this idea and expanded it to cover if, for and while which >can all suffer from this. Using the relative indent to work out which >are valid combinations: But. The above regex does not seem to handle if ((a = b)); oops; I have tried to come up with a superduper regex that handles multiple (), but my regex fu seems to stop above two pairs of (). #!/usr/bin/perl @check = ( q"if ((ptr = malloc(bong, GFP)) == NULL) ; (oopsie) ;", q"if ((ptr = malloc(bong, GFP)) == NULL)(alright);", q"if ( ({ bool evil = (true); evil; }) ) ; (oopsie) ;", q"if ( ({ bool evil = (true); evil; }) ) (alright);", q"if((()));", ); my $r = qr/\s*\(\s*(??{$r})?\s*\)\s*|\s*\(\s*[^()]+\s*\)\s*/; foreach (@check) { if ($_ =~ /(if|for|while)$r;/) { print "ok $_\n"; } } Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: I think they would both be equally ugly, You think both these are equivalent in terms of "looks": | while (!atomic_read(&v)) { | while (!atomic_read_xxx(&v)) { ... | ... cpu_relax_no_barrier(); | cpu_relax_no_barrier(); order_atomic(&v); | } } | (where order_atomic() is an atomic_t specific wrapper as you mentioned below) ? I think the LHS is better if your atomic_read_xxx primitive is using the crazy one-sided barrier, because the LHS code you immediately know what barriers are happening, and with the RHS you have to look at the atomic_read_xxx definition. If your atomic_read_xxx implementation was more intuitive, then both are pretty well equal. More lines != ugly code. but the atomic_read_volatile variant would be more prone to subtle bugs because of the weird implementation. What bugs? You can't think for yourself? Your atomic_read_volatile contains a compiler barrier to the atomic variable before the load. 2 such reads from different locations look like this: asm volatile("" : "+m" (v1)); atomic_read(&v1); asm volatile("" : "+m" (v2)); atomic_read(&v2); Which implies that the load of v1 can be reordered to occur after the load of v2. Bet you didn't expect that? Secondly, what sort of code would do such a thing? See the nodemgr_host_thread() that does something similar, though not exactly same. I'm sorry, all this waffling about made up code which might do this and that is just a waste of time. Seriously, the thread is bloated enough and never going to get anywhere with all this handwaving. If someone is saving up all the really real and actually good arguments for why we must have a volatile here, now is the time to use them. and have barriers both before and after the memory operation, How could that lead to bugs? (if you can point to existing code, but just some testcase / sample code would be fine as well). See above. As I said, barrier() is too heavy-handed. Typo. I meant: defined for a single memory location (ie. order(x)). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFHelp: Splitting MAINTAINERS into maintainers/* and Makefile/Kconfig support
On Fri, 2007-08-17 at 02:25 -0400, Chris Snook wrote: > The whole point of MAINTAINERS is to have one central repository for this > information, instead of scattering it throughout the various source files. > If > that file is getting too unwieldy (and I don't think it is) then I could > understand splitting it up hierarchically, for example having a > drivers/net/MAINTAINERS that listed the info for all the net drivers. The individual MAINTAINERS files eliminates what Linus described as their "hotness". No shared updates by multiple parties. > What you're suggesting is a less efficient equivalent to putting the info > directly into the source files. I believe that wrong. Maintainer patterns frequently look like: F: arch/i386/kernel/cpu/cpufreq/ F: drivers/cpufreq/ F: include/linux/cpufreq.h If in source, this would currently require 21 + 12 + 1 modifications instead of 1. > If that approach was enough to make people > happy, we wouldn't have MAINTAINERS to begin with. I think the insertion of maintainers into source itself is wrong. It's freeform, error prone and requires significant modifications to source files as maintainers come and go. > Perhaps with a little automation it could be revived, Which is the help I'm looking for. Can someone please help here on ideas or implementation adding a makefile target for MAINTAINERS from files in a specific subdirectory? > though I think that adding a path pattern > removes the need, while keeping it easier to parse by scripts. > I appreciate the effort to make MAINTAINERS more useful, > but please don't add another 600 files to the tree. In the distributed form, you'll still end up with ~400 new files spread all over the tree. Either way, you'll have hundreds of files. $ grep "^F:" MAINTAINERS | sed -e "s/[A-Za-z0-9\_\*\.\-]*$//" | sort | uniq | wc -l 415 Centralized as maintainers/* or distributed as ../../../Maintainers Pick one, I don't much care, but I'm still looking for Makefile/KConfig help reassembling it into a single MAINTAINERS block similar to the current form. Help? cheers, Joe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Corrupted filesystem with new Firewire stack
> "Stefan" == Stefan Richter <[EMAIL PROTECTED]> writes: Stefan> There were some similar reports involving that "status write Stefan> for unknown orb". I haven't found a way to reproduce it; I Stefan> noticed it only once in the logs here so far. I get those all the time. Just do heavy ext3 I/O to the drive. Happens here on both a G4 and an intel Mini. Both running FC7. Aug 17 08:24:08 mini kernel: firewire_sbp2: status write for unknown orb Aug 17 08:25:08 mini kernel: firewire_sbp2: sbp2_scsi_abort Aug 17 08:26:36 mini kernel: firewire_sbp2: status write for unknown orb Aug 17 08:27:36 mini kernel: firewire_sbp2: sbp2_scsi_abort Aug 17 08:33:51 mini kernel: firewire_sbp2: status write for unknown orb Aug 17 08:34:51 mini kernel: firewire_sbp2: sbp2_scsi_abort Lacie drive in both cases. -- Martin K. Petersen http://mkp.net/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 000 of 6] A few block-layer tidy-up patches.
On Fri, Aug 17 2007, Neil Brown wrote: > On Friday August 17, [EMAIL PROTECTED] wrote: > > > > Please inspect the #block-2.6.24 branch to see the result. > > > > > > I don't know where to look for this. I checked > > > http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git > > > but they don't seem to be there. > > > ?? > > > > That's where it is, but the kernel.org mirroring is just horrible slow. > > Just checked now and it's there. > > I discovered I was looking in the wrong place - not being very > familiar with git terminology. > I found them and it looks right. Ah, good. > I had a bit of a look at removing bio_data and ->buffer ... the > usages outside of drivers/ide are fairly easy to deal with - I might > send a patch for that. The drivers/ide stuff looks like a fairly > substantial rewrite is in order. > e.g. idefloppy_packet_command_s seems to duplicate a lot of > fields from 'struct request', and it should probably use the request > struct directly. We can do it bits at the time, the triviel ones first. I have some experience with drivers/ide/ myself, so I can attempt to tackle some of that myself. > But a number of times ->buffer points to ->cmd, and there is no bio. > I guess we should use bio_map_kern to make a bio? It points to ->cmd?! But yes, generally things just need to be converted to map the data to a bio with bio_map_kern() (or bio_map_user(), where appropriate). > I'll see if I can come up with something testing it might be > awkward. I have an ide cdrom I can test on. Maybe an ide disk,, but > not an ide floppy :-) I can help with that as well :-) -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Because they should be thinking about them in terms of barriers, over which the compiler / CPU is not to reorder accesses or cache memory operations, rather than "special" "volatile" accesses. This is obviously just a taste thing. Whether to have that forget(x) barrier as something author should explicitly sprinkle appropriately in appropriate places in the code by himself or use a primitive that includes it itself. That's not obviously just taste to me. Not when the primitive has many (perhaps, the majority) of uses that do not require said barriers. And this is not solely about the code generation (which, as Paul says, is relatively minor even on x86). I prefer people to think explicitly about barriers in their lockless code. I'm not saying "taste matters aren't important" (they are), but I'm really skeptical if most folks would find the former tasteful. So I /do/ have better taste than most folks? Thanks! :-) And by the way, the point is *also* about the fact that cpu_relax(), as of today, implies a full memory clobber, which is not what a lot of such loops want. (due to stuff mentioned elsewhere, summarized in that summary) That's not the point, That's definitely the point, why not. This is why "barrier()", being heavy-handed, is not the best option. That is _not_ the point (of why a volatile atomic_read is good) because there has already been an alternative posted that better conforms with Linux barrier API and is much more widely useful and more usable. If you are so worried about barrier() being too heavyweight, then you're off to a poor start by wanting to add a few K of kernel text by making atomic_read volatile. because as I also mentioned, the logical extention to Linux's barrier API to handle this is the order(x) macro. Again, not special volatile accessors. Sure, that forget(x) macro _is_ proposed to be made part of the generic API. Doesn't explain why not to define/use primitives that has volatility semantics in itself, though (taste matters apart). If you follow the discussion You were thinking of a reason why the semantics *should* be changed or added, and I was rebutting your argument that it must be used when a full barrier() is too heavy (ie. by pointing out that order() has superior semantics anyway). Why do I keep repeating the same things? I'll not continue bloating this thread until a new valid point comes up... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Satyam Sharma wrote: > > On Fri, 17 Aug 2007, Nick Piggin wrote: > > > Satyam Sharma wrote: > > > > > > It is very obvious. msleep calls schedule() (ie. sleeps), which is > > > always a barrier. > > > > Probably you didn't mean that, but no, schedule() is not barrier because > > it sleeps. It's a barrier because it's invisible. > > Where did I say it is a barrier because it sleeps? Just below. What you wrote: > It is always a barrier because, at the lowest level, schedule() (and thus > anything that sleeps) is defined to always be a barrier. "It is always a barrier because, at the lowest level, anything that sleeps is defined to always be a barrier". > Regardless of > whatever obscure means the compiler might need to infer the barrier. > > In other words, you can ignore those obscure details because schedule() is > always going to have an explicit barrier in it. I didn't quite understand what you said here, so I'll tell what I think: * foo() is a compiler barrier if the definition of foo() is invisible to the compiler at a callsite. * foo() is also a compiler barrier if the definition of foo() includes a barrier, and it is inlined at the callsite. If the above is wrong, or if there's something else at play as well, do let me know. > > > The "unobvious" thing is that you wanted to know how the compiler knows > > > a function is a barrier -- answer is that if it does not *know* it is not > > > a barrier, it must assume it is a barrier. > > > > True, that's clearly what happens here. But are you're definitely joking > > that this is "obvious" in terms of code-clarity, right? > > No. If you accept that barrier() is implemented correctly, and you know > that sleeping is defined to be a barrier, Curiously, that's the second time you've said "sleeping is defined to be a (compiler) barrier". How does the compiler even know if foo() is a function that "sleeps"? Do compilers have some notion of "sleeping" to ensure they automatically assume a compiler barrier whenever such a function is called? Or are you saying that the compiler can see the barrier() inside said function ... nopes, you're saying quite the opposite below. > then its perfectly clear. You > don't have to know how the compiler "knows" that some function contains > a barrier. I think I do, why not? Would appreciate if you could elaborate on this. Satyam - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Marvell 88E8056 gigabit ethernet controller
Hi all, I've read where the onboard Marvell lan controller on some Gigabyte boards don't work. I've got two systems using the same Gigabyte board, on one the LAN works on the other it dies like described by others. Here's the systems: Working system: Gigabyte 965P-DS3 rev 3.3 (BIOS F10) Core2 Q6600 2GB Corsair XMS2 memory kernel 2.6.22.3 lspci for LAN controller: 04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 14) Broken system: Gigabyte 965P-DS3 rev 3.3 (BIOS F10) Core2 E4400 2GB Corsair XMS2 memory kernel 2.6.22.3 lspci for LAN controller: 03:00.0 Ethernet controller: Marvell Technology Group Ltd. Unknown device 4364 (rev 12) The BIOS for the two systems are setup the same and the config for the kernels are the same too. I've actually tried taking the kernel from the working system and booting it on the broken one but still the LAN dies after a couple of seconds. The working system has one card plugged in (nvidia based PCI-X video card), I've taken that card and plugged into the broken system, booted the same kernel, and it still dies after a while. I will gladly provide any info needed if it can help in getting this chipset working on the Gigabyte boards. Thanks, Kevin Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545433 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Satyam Sharma wrote: > [...] > > You think both these are equivalent in terms of "looks": > > > > | > > while (!atomic_read(&v)) { | while (!atomic_read_xxx(&v)) { > > ... | ... > > cpu_relax_no_barrier(); | > > cpu_relax_no_barrier(); > > order_atomic(&v); | } > > } | > > > > (where order_atomic() is an atomic_t > > specific wrapper as you mentioned below) > > > > ? > > I think the LHS is better if your atomic_read_xxx primitive is using the > crazy one-sided barrier, ^ I'd say it's purposefully one-sided. > because the LHS code you immediately know what > barriers are happening, and with the RHS you have to look at the > atomic_read_xxx > definition. No. As I said, the _xxx (whatever the heck you want to name it as) should give the same heads-up that your "order_atomic" thing is supposed to give. > If your atomic_read_xxx implementation was more intuitive, then both are > pretty well equal. More lines != ugly code. > > > [...] > > What bugs? > > You can't think for yourself? Your atomic_read_volatile contains a compiler > barrier to the atomic variable before the load. 2 such reads from different > locations look like this: > > asm volatile("" : "+m" (v1)); > atomic_read(&v1); > asm volatile("" : "+m" (v2)); > atomic_read(&v2); > > Which implies that the load of v1 can be reordered to occur after the load > of v2. And how would that be a bug? (sorry, I really can't think for myself) > > > Secondly, what sort of code would do such a thing? > > > > See the nodemgr_host_thread() that does something similar, though not > > exactly same. > > I'm sorry, all this waffling about made up code which might do this and > that is just a waste of time. First, you could try looking at the code. And by the way, as I've already said (why do *require* people to have to repeat things to you?) this isn't even about only existing code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism
Laurent Vivier wrote: >> >>> - remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate >>> guest time (by calling something like guest_enter() and guest_exit() from >>> the >>> virtualization engine), and when in account_system_time() we have cputime > >>> vtime we substrate vtime from cputime and add vtime to user time and guest >>> time. >>> But doing like this we freeze in kernel/sched.c the link between system >>> time, >>> user time and guest time (i.e. system time = system time - vtime, user time >>> = >>> user time + vtime and guest time = guest time + vtime). >>> >> Actually, I think we can set a per-cpu "in_guest" flag for the scheduler >> code, which then knows to add the tick to the guest time. That seems >> the simplest possible solution. >> >> lguest or kvm would set the flag before running the guest (which is done >> with preempt disabled or using preemption hooks), and reset it >> afterwards. >> >> Thoughts? >> > > It was my first attempt (except I didn't have a per-cpu flag, but a per-task > flag), it's not visible but I love simplicity... ;-) > > A KVM VCPU is stopped by preemption, so when we enter in scheduler we have > exited from VCPU and thus this flags is off (so we account 0 to the guest). > What > I did then is "set the flag on when we enter in the VCPU, and > "account_system_time()" sets the flag off when it adds this timeslice to > cpustat > (and compute correctly guest, user, system time). But I didn't like this idea > because all code executed after we entered in the VCPU is accounted to the > guest > until we have an account_system_time() and I suppose we can have real system > time in this part. And I guess a VCPU can be less than 1 ms (unit of cputime) > in > a timeslice. > > So ? What's best ? > The normal user/system accounting has the same issue, no? Whereever we happen to land (kernel or user) gets the whole tick. So I think it is okay to have the same limitation for guest time. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Satyam Sharma wrote: It is very obvious. msleep calls schedule() (ie. sleeps), which is always a barrier. Probably you didn't mean that, but no, schedule() is not barrier because it sleeps. It's a barrier because it's invisible. Where did I say it is a barrier because it sleeps? Just below. What you wrote: It is always a barrier because, at the lowest level, schedule() (and thus anything that sleeps) is defined to always be a barrier. "It is always a barrier because, at the lowest level, anything that sleeps is defined to always be a barrier". ... because it must call schedule and schedule is a barrier. Regardless of whatever obscure means the compiler might need to infer the barrier. In other words, you can ignore those obscure details because schedule() is always going to have an explicit barrier in it. I didn't quite understand what you said here, so I'll tell what I think: * foo() is a compiler barrier if the definition of foo() is invisible to the compiler at a callsite. * foo() is also a compiler barrier if the definition of foo() includes a barrier, and it is inlined at the callsite. If the above is wrong, or if there's something else at play as well, do let me know. Right. The "unobvious" thing is that you wanted to know how the compiler knows a function is a barrier -- answer is that if it does not *know* it is not a barrier, it must assume it is a barrier. True, that's clearly what happens here. But are you're definitely joking that this is "obvious" in terms of code-clarity, right? No. If you accept that barrier() is implemented correctly, and you know that sleeping is defined to be a barrier, Curiously, that's the second time you've said "sleeping is defined to be a (compiler) barrier". _In Linux,_ sleeping is defined to be a compiler barrier. How does the compiler even know if foo() is a function that "sleeps"? Do compilers have some notion of "sleeping" to ensure they automatically assume a compiler barrier whenever such a function is called? Or are you saying that the compiler can see the barrier() inside said function ... nopes, you're saying quite the opposite below. You're getting too worried about the compiler implementation. Start by assuming that it does work ;) then its perfectly clear. You don't have to know how the compiler "knows" that some function contains a barrier. I think I do, why not? Would appreciate if you could elaborate on this. If a function is not completely visible to the compiler (so it can't determine whether a barrier could be in it or not), then it must always assume it will contain a barrier so it always does the right thing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH/RFC 4/4, second shot]KVM uses "account_guest_time()"
Laurent Vivier wrote: > KVM updates vtime in task_struct to allow account_guest_time() to modify user, > system and guest time in cpustat accordingly. > > --- kvm.orig/drivers/kvm/Kconfig 2007-08-17 10:24:46.0 +0200 > +++ kvm/drivers/kvm/Kconfig 2007-08-17 10:25:25.0 +0200 > @@ -41,4 +41,10 @@ > Provides support for KVM on AMD processors equipped with the AMD-V > (SVM) extensions. > > +config GUEST_ACCOUNTING > + bool "Virtual Machine accounting support" > + depends on KVM > + ---help--- > + Allows to account CPU time used by the Virtual Machines. > + Other way round. In the patch that adds account_guest_time(), have a CONFIG_GUEST_ACCOUNTING (defaulting to n, with no description, help, or dependencies. Then, CONFIG_KVM can select GUEST_ACCOUNTING. The advantages of this are: - the puppyvisor can also select this if it so wishes - we don't have core code reference some obscure module CONFIG_PREEMPT_NOTIFIERS does the same thing. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism
Avi Kivity wrote: > Laurent Vivier wrote: >>> - remove PATCH 3, and add in task_struct a "ktime vtime" where we accumulate guest time (by calling something like guest_enter() and guest_exit() from the virtualization engine), and when in account_system_time() we have cputime > vtime we substrate vtime from cputime and add vtime to user time and guest time. But doing like this we freeze in kernel/sched.c the link between system time, user time and guest time (i.e. system time = system time - vtime, user time = user time + vtime and guest time = guest time + vtime). >>> Actually, I think we can set a per-cpu "in_guest" flag for the scheduler >>> code, which then knows to add the tick to the guest time. That seems >>> the simplest possible solution. >>> >>> lguest or kvm would set the flag before running the guest (which is done >>> with preempt disabled or using preemption hooks), and reset it >>> afterwards. >>> >>> Thoughts? >>> >> It was my first attempt (except I didn't have a per-cpu flag, but a per-task >> flag), it's not visible but I love simplicity... ;-) >> >> A KVM VCPU is stopped by preemption, so when we enter in scheduler we have >> exited from VCPU and thus this flags is off (so we account 0 to the guest). >> What >> I did then is "set the flag on when we enter in the VCPU, and >> "account_system_time()" sets the flag off when it adds this timeslice to >> cpustat >> (and compute correctly guest, user, system time). But I didn't like this idea >> because all code executed after we entered in the VCPU is accounted to the >> guest >> until we have an account_system_time() and I suppose we can have real system >> time in this part. And I guess a VCPU can be less than 1 ms (unit of >> cputime) in >> a timeslice. >> >> So ? What's best ? >> > > The normal user/system accounting has the same issue, no? Whereever we > happen to land (kernel or user) gets the whole tick. Yes... but perhaps I should rewrite this too ;-) > So I think it is okay to have the same limitation for guest time. OK, so we can go back to my first patch. Who can decide to introduce this into the kernel ? Laurent -- - [EMAIL PROTECTED] -- "Software is hard" - Donald Knuth signature.asc Description: OpenPGP digital signature
[PATCH -mm] x86_64: Save registers in saved_context during suspend and hibernation
From: Rafael J. Wysocki <[EMAIL PROTECTED]> During hibernation and suspend on x86_64 save CPU registers in the saved_context structure rather than in a handful of separate variables. Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]> Looks-ok-to: Pavel Machek <[EMAIL PROTECTED]> --- arch/x86_64/kernel/acpi/wakeup.S | 101 --- arch/x86_64/kernel/asm-offsets.c | 28 ++ arch/x86_64/kernel/suspend.c |6 -- arch/x86_64/kernel/suspend_asm.S | 72 ++- include/asm-x86_64/suspend.h | 23 ++-- 5 files changed, 125 insertions(+), 105 deletions(-) Index: linux-2.6.23-rc3/arch/x86_64/kernel/asm-offsets.c === --- linux-2.6.23-rc3.orig/arch/x86_64/kernel/asm-offsets.c +++ linux-2.6.23-rc3/arch/x86_64/kernel/asm-offsets.c @@ -76,6 +76,34 @@ int main(void) DEFINE(pbe_orig_address, offsetof(struct pbe, orig_address)); DEFINE(pbe_next, offsetof(struct pbe, next)); BLANK(); +#define ENTRY(entry) DEFINE(pt_regs_ ## entry, offsetof(struct pt_regs, entry)) + ENTRY(rbx); + ENTRY(rbx); + ENTRY(rcx); + ENTRY(rdx); + ENTRY(rsp); + ENTRY(rbp); + ENTRY(rsi); + ENTRY(rdi); + ENTRY(r8); + ENTRY(r9); + ENTRY(r10); + ENTRY(r11); + ENTRY(r12); + ENTRY(r13); + ENTRY(r14); + ENTRY(r15); + ENTRY(eflags); + BLANK(); +#undef ENTRY +#define ENTRY(entry) DEFINE(saved_context_ ## entry, offsetof(struct saved_context, entry)) + ENTRY(cr0); + ENTRY(cr2); + ENTRY(cr3); + ENTRY(cr4); + ENTRY(cr8); + BLANK(); +#undef ENTRY DEFINE(TSS_ist, offsetof(struct tss_struct, ist)); BLANK(); DEFINE(crypto_tfm_ctx_offset, offsetof(struct crypto_tfm, __crt_ctx)); Index: linux-2.6.23-rc3/include/asm-x86_64/suspend.h === --- linux-2.6.23-rc3.orig/include/asm-x86_64/suspend.h +++ linux-2.6.23-rc3/include/asm-x86_64/suspend.h @@ -3,6 +3,9 @@ * Based on code * Copyright 2001 Patrick Mochel <[EMAIL PROTECTED]> */ +#ifndef __ASM_X86_64_SUSPEND_H +#define __ASM_X86_64_SUSPEND_H + #include #include @@ -12,8 +15,9 @@ arch_prepare_suspend(void) return 0; } -/* Image of the saved processor state. If you touch this, fix acpi_wakeup.S. */ +/* Image of the saved processor state. If you touch this, fix acpi/wakeup.S. */ struct saved_context { + struct pt_regs regs; u16 ds, es, fs, gs, ss; unsigned long gs_base, gs_kernel_base, fs_base; unsigned long cr0, cr2, cr3, cr4, cr8; @@ -29,27 +33,14 @@ struct saved_context { unsigned long tr; unsigned long safety; unsigned long return_address; - unsigned long eflags; } __attribute__((packed)); -/* We'll access these from assembly, so we'd better have them outside struct */ -extern unsigned long saved_context_eax, saved_context_ebx, saved_context_ecx, saved_context_edx; -extern unsigned long saved_context_esp, saved_context_ebp, saved_context_esi, saved_context_edi; -extern unsigned long saved_context_r08, saved_context_r09, saved_context_r10, saved_context_r11; -extern unsigned long saved_context_r12, saved_context_r13, saved_context_r14, saved_context_r15; -extern unsigned long saved_context_eflags; - #define loaddebug(thread,register) \ set_debugreg((thread)->debugreg##register, register) extern void fix_processor_context(void); -extern unsigned long saved_rip; -extern unsigned long saved_rsp; -extern unsigned long saved_rbp; -extern unsigned long saved_rbx; -extern unsigned long saved_rsi; -extern unsigned long saved_rdi; - /* routines for saving/restoring kernel state */ extern int acpi_save_state_mem(void); + +#endif /* __ASM_X86_64_SUSPEND_H */ Index: linux-2.6.23-rc3/arch/x86_64/kernel/suspend.c === --- linux-2.6.23-rc3.orig/arch/x86_64/kernel/suspend.c +++ linux-2.6.23-rc3/arch/x86_64/kernel/suspend.c @@ -19,12 +19,6 @@ extern const void __nosave_begin, __nosa struct saved_context saved_context; -unsigned long saved_context_eax, saved_context_ebx, saved_context_ecx, saved_context_edx; -unsigned long saved_context_esp, saved_context_ebp, saved_context_esi, saved_context_edi; -unsigned long saved_context_r08, saved_context_r09, saved_context_r10, saved_context_r11; -unsigned long saved_context_r12, saved_context_r13, saved_context_r14, saved_context_r15; -unsigned long saved_context_eflags; - void __save_processor_state(struct saved_context *ctxt) { kernel_fpu_begin(); Index: linux-2.6.23-rc3/arch/x86_64/kernel/suspend_asm.S === --- linux-2.6.23-rc3.orig/arch/x86_64/kernel/suspend_asm.S +++ linux-2.6.23-rc3/arch/x86_64/kernel/suspend_asm.S @@ -17,24 +17,24 @@ #include
[PATCH -mm] PNP: Make pnpacpi_suspend handle errors
From: Rafael J. Wysocki <[EMAIL PROTECTED]> pnpacpi_suspend() doesn't check the result returned by acpi_pm_device_sleep_state() before passing it to acpi_bus_set_power(), which may not be desirable. Make it select the target power state of the device using its second argument if acpi_pm_device_sleep_state() fails. Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]> Looks-ok-to: Pavel Machek <[EMAIL PROTECTED]> --- drivers/pnp/pnpacpi/core.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) Index: linux-2.6.23-rc3/drivers/pnp/pnpacpi/core.c === --- linux-2.6.23-rc3.orig/drivers/pnp/pnpacpi/core.c2007-08-15 14:56:58.0 +0200 +++ linux-2.6.23-rc3/drivers/pnp/pnpacpi/core.c 2007-08-15 15:08:36.0 +0200 @@ -130,11 +130,16 @@ static int pnpacpi_disable_resources(str #ifdef CONFIG_ACPI_SLEEP static int pnpacpi_suspend(struct pnp_dev *dev, pm_message_t state) { - return acpi_bus_set_power((acpi_handle) dev->data, - acpi_pm_device_sleep_state(&dev->dev, -device_may_wakeup -(&dev->dev), -NULL)); + int power_state; + + power_state = acpi_pm_device_sleep_state(&dev->dev, + device_may_wakeup(&dev->dev), + NULL); + if (power_state < 0) + power_state = (state.event == PM_EVENT_ON) ? + ACPI_STATE_D0 : ACPI_STATE_D3; + + return acpi_bus_set_power((acpi_handle) dev->data, power_state); } static int pnpacpi_resume(struct pnp_dev *dev) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH/RFC 4/4, second shot]KVM uses "account_guest_time()"
Avi Kivity wrote: > Laurent Vivier wrote: >> KVM updates vtime in task_struct to allow account_guest_time() to modify >> user, >> system and guest time in cpustat accordingly. >> > >> --- kvm.orig/drivers/kvm/Kconfig 2007-08-17 10:24:46.0 +0200 >> +++ kvm/drivers/kvm/Kconfig 2007-08-17 10:25:25.0 +0200 >> @@ -41,4 +41,10 @@ >>Provides support for KVM on AMD processors equipped with the AMD-V >>(SVM) extensions. >> >> +config GUEST_ACCOUNTING >> +bool "Virtual Machine accounting support" >> +depends on KVM >> +---help--- >> + Allows to account CPU time used by the Virtual Machines. >> + > > > Other way round. In the patch that adds account_guest_time(), have a > CONFIG_GUEST_ACCOUNTING (defaulting to n, with no description, help, or > dependencies. Then, CONFIG_KVM can select GUEST_ACCOUNTING. I was wondering in which Kconfig I can put it... > The advantages of this are: > - the puppyvisor can also select this if it so wishes > - we don't have core code reference some obscure module I agree. > CONFIG_PREEMPT_NOTIFIERS does the same thing. I saw Laurent -- - [EMAIL PROTECTED] -- "Software is hard" - Donald Knuth signature.asc Description: OpenPGP digital signature
Re: [patch] encapsulate uevent()/add_uevent_var() buffer handling
On Tue, 14 Aug 2007 15:15:12 +0200, Kay Sievers <[EMAIL PROTECTED]> wrote: > From: Kay Sievers <[EMAIL PROTECTED]> > Subject: Driver core: change add_uevent_var to use a struct This still needs some (trivial) s390 fixes: Signed-off-by: Cornelia Huck <[EMAIL PROTECTED]> --- drivers/s390/cio/device.c|2 -- drivers/s390/crypto/ap_bus.c |2 +- 2 files changed, 1 insertion(+), 3 deletions(-) --- linux-2.6.orig/drivers/s390/crypto/ap_bus.c +++ linux-2.6/drivers/s390/crypto/ap_bus.c @@ -458,7 +458,7 @@ static int ap_bus_match(struct device *d * uevent function for AP devices. It sets up a single environment * variable DEV_TYPE which contains the hardware device type. */ -static int ap_uevent (struct device *dev, struct kobj_uevent_env) +static int ap_uevent(struct device *dev, struct kobj_uevent_env *env) { struct ap_device *ap_dev = to_ap_dev(dev); int retval = 0; --- linux-2.6.orig/drivers/s390/cio/device.c +++ linux-2.6/drivers/s390/cio/device.c @@ -82,8 +82,6 @@ static int ccw_uevent(struct device *dev { struct ccw_device *cdev = to_ccwdev(dev); struct ccw_device_id *id = &(cdev->id); - int i = 0; - int len = 0; int ret; char modalias_buf[30]; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH/RFC 3/4, second shot]Introduce "account_guest_time"
Laurent Vivier wrote: > This is another way to compute guest time... I remove the "account modifiers" > mechanism and call directly account_guest_time() from account_system_time(). > account_system_time() computes user, system and guest times according value > accumulated in vtime (a ktime_t) in task_struct by the virtual machine. > > @@ -3246,6 +3277,10 @@ > struct rq *rq = this_rq(); > cputime64_t tmp; > > +#ifdef CONFIG_GUEST_ACCOUNTING > + cputime = account_guest_time(p, cputime); > +#endif > + > p->stime = cputime_add(p->stime, cputime); > > /* Add system time to cpustat. */ In order to reduce the impact on whatever function this is in (use diff -p please), you can always have a definition of account_guest_time: #else static cputime_t account_guest_time(struct task_struct *p, cputime_t cputime) { return cputime; } #endif This way the #ifdef/#endif is not necessary when calling it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, 17 Aug 2007, Nick Piggin wrote: > Satyam Sharma wrote: > > > On Fri, 17 Aug 2007, Nick Piggin wrote: > > > > > Because they should be thinking about them in terms of barriers, over > > > which the compiler / CPU is not to reorder accesses or cache memory > > > operations, rather than "special" "volatile" accesses. > > > > This is obviously just a taste thing. Whether to have that forget(x) > > barrier as something author should explicitly sprinkle appropriately > > in appropriate places in the code by himself or use a primitive that > > includes it itself. > > That's not obviously just taste to me. Not when the primitive has many > (perhaps, the majority) of uses that do not require said barriers. And > this is not solely about the code generation (which, as Paul says, is > relatively minor even on x86). See, you do *require* people to have to repeat the same things to you! As has been written about enough times already, and if you followed the discussion on this thread, I am *not* proposing that atomic_read()'s semantics be changed to have any extra barriers. What is proposed is a different atomic_read_xxx() variant thereof, that those can use who do want that. Now whether to have a kind of barrier ("volatile", whatever) in the atomic_read_xxx() itself, or whether to make the code writer himself to explicitly write the order(x) appropriately in appropriate places in the code _is_ a matter of taste. > > That's definitely the point, why not. This is why "barrier()", being > > heavy-handed, is not the best option. > > That is _not_ the point [...] Again, you're requiring me to repeat things that were already made evident on this thread (if you follow it). This _is_ the point, because a lot of loops out there (too many of them, I WILL NOT bother citing file_name:line_number) end up having to use a barrier just because they're using a loop-exit-condition that depends on a value returned by atomic_read(). It would be good for them if they used an atomic_read_xxx() primitive that gave these "volatility" semantics without junking compiler optimizations for other memory references. > because there has already been an alternative posted Whether that alternative (explicitly using forget(x), or wrappers thereof, such as the "order_atomic" you proposed) is better than other alternatives (such as atomic_read_xxx() which includes the volatility behaviour in itself) is still open, and precisely what we started discussing just one mail back. (The above was also mostly stuff I had to repeated for you, sadly.) > that better conforms with Linux barrier > API and is much more widely useful and more usable. I don't think so. (Now *this* _is_ the "taste-dependent matter" that I mentioned earlier.) > If you are so worried > about > barrier() being too heavyweight, then you're off to a poor start by wanting to > add a few K of kernel text by making atomic_read volatile. Repeating myself, for the N'th time, NO, I DON'T want to make atomic_read have "volatile" semantics. > > > because as I also mentioned, the logical extention > > > to Linux's barrier API to handle this is the order(x) macro. Again, not > > > special volatile accessors. > > > > Sure, that forget(x) macro _is_ proposed to be made part of the generic > > API. Doesn't explain why not to define/use primitives that has volatility > > semantics in itself, though (taste matters apart). ^ > If you follow the discussion You were thinking of a reason why the > semantics *should* be changed or added, and I was rebutting your argument > that it must be used when a full barrier() is too heavy (ie. by pointing > out that order() has superior semantics anyway). Amazing. Either you have reading comprehension problems, or else, please try reading this thread (or at least this sub-thread) again. I don't want _you_ blaming _me_ for having to repeat things to you all over again. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hostfs: Remove pointless if statement
On Fri, Aug 17, 2007 at 12:24:23PM +0530, Satyam Sharma wrote: > We normally use "comments" for that, not dead code that a compiler > then elids ;-) I'd argue that comments are for when you can't make the code self-explanatory. > [PATCH] hostfs: Remove pointless if statement > > And replace with comment saying we don't handle ctime. > Also some codingstyle while I was there. I'll forward this upstream. Jeff -- Work email - jdike at linux dot intel dot com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism
Am Freitag, 17. August 2007 schrieb Laurent Vivier: > > The normal user/system accounting has the same issue, no? Whereever we > > happen to land (kernel or user) gets the whole tick. > > Yes... but perhaps I should rewrite this too ;-) If you look further, you will see, that this was actually rewritten in 2.6.12 and thats why we have cputime_t. The infrastructure is currently only used by s390 and ppc64. On s390 we use our cpu timer to get the current time on each syscall/irq/context switch etc to get precise accounting data for system/user/irq/softirq/nice. We also get steal time (this is interesting for the guest: how much of my cpu was actually not available because the hypervisor scheduled me away). I dont know enough about other architectures to say which one could exploit this infrastructure as well. The current git tree is somewhat broken for CONFIG_VIRT_CPU_ACCOUTING due to the accouting changes introduced by CFS - we work on this. If you are interested in the cputime stuff, you can have a look at arch/s390/kernel/vtime.c (account_system_vtime, account_vtime) as well as: http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=6be7071fdd321c206b1ee7a3e534782f25572830 for the first introduction and http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=db59f519d37f80683f3611927f6b5c3bdfc0f9e6 for the s390 exploitation. Christian - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism
Avi Kivity wrote: [...] > > The normal user/system accounting has the same issue, no? Whereever we > happen to land (kernel or user) gets the whole tick. > > So I think it is okay to have the same limitation for guest time. > So this is how it looks like. PATCH 1 and 2 are always a prerequisite. Laurent Index: kvm/include/linux/sched.h === --- kvm.orig/include/linux/sched.h 2007-08-17 15:07:02.0 +0200 +++ kvm/include/linux/sched.h 2007-08-17 15:08:19.0 +0200 @@ -1310,6 +1310,7 @@ #define PF_STARTING0x0002 /* being created */ #define PF_EXITING 0x0004 /* getting shut down */ #define PF_EXITPIDONE 0x0008 /* pi exit done on shut down */ +#define PF_VCPU0x0010 /* I'm a virtual CPU */ #define PF_FORKNOEXEC 0x0040 /* forked but didn't exec */ #define PF_SUPERPRIV 0x0100 /* used super-user privileges */ #define PF_DUMPCORE0x0200 /* dumped core */ Index: kvm/kernel/sched.c === --- kvm.orig/kernel/sched.c 2007-08-17 14:42:43.0 +0200 +++ kvm/kernel/sched.c 2007-08-17 15:16:20.0 +0200 @@ -3246,10 +3246,22 @@ struct rq *rq = this_rq(); cputime64_t tmp; + tmp = cputime_to_cputime64(cputime); + if (p->flags & PF_VCPU) { + p->utime = cputime_add(p->utime, cputime); + p->gtime = cputime_add(p->gtime, cputime); + + cpustat->guest = cputime64_add(cpustat->guest, tmp); + cpustat->user = cputime64_add(cpustat->user, tmp); + + p->flags &= ~PF_VCPU; + + return; + } + p->stime = cputime_add(p->stime, cputime); /* Add system time to cpustat. */ - tmp = cputime_to_cputime64(cputime); if (hardirq_count() - hardirq_offset) cpustat->irq = cputime64_add(cpustat->irq, tmp); else if (softirq_count()) Index: kvm/drivers/kvm/kvm.h === --- kvm.orig/drivers/kvm/kvm.h 2007-08-17 15:26:16.0 +0200 +++ kvm/drivers/kvm/kvm.h 2007-08-17 15:29:46.0 +0200 @@ -589,6 +589,19 @@ int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +#ifndef PF_VCPU +#define PF_VCPU0 /* no kernel support */ +#endif + +static inline void kvm_guest_enter(void) +{ + current->flags |= PF_VCPU; +} + +static inline void kvm_guest_exit(void) +{ +} + static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) { Index: kvm/drivers/kvm/svm.c === --- kvm.orig/drivers/kvm/svm.c 2007-08-17 15:26:16.0 +0200 +++ kvm/drivers/kvm/svm.c 2007-08-17 15:27:03.0 +0200 @@ -1404,6 +1404,7 @@ clgi(); vcpu->guest_mode = 1; + kvm_guest_enter(); if (vcpu->requests) if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests)) svm_flush_tlb(vcpu); @@ -1536,6 +1537,7 @@ #endif : "cc", "memory" ); + kvm_guest_exit(); vcpu->guest_mode = 0; if (vcpu->fpu_active) { Index: kvm/drivers/kvm/vmx.c === --- kvm.orig/drivers/kvm/vmx.c 2007-08-17 15:26:16.0 +0200 +++ kvm/drivers/kvm/vmx.c 2007-08-17 15:27:45.0 +0200 @@ -2078,6 +2078,7 @@ local_irq_disable(); vcpu->guest_mode = 1; + kvm_guest_enter(); if (vcpu->requests) if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests)) vmx_flush_tlb(vcpu); @@ -2198,6 +2199,7 @@ [cr2]"i"(offsetof(struct kvm_vcpu, cr2)) : "cc", "memory" ); + kvm_guest_exit(); vcpu->guest_mode = 0; local_irq_enable();
[PATCH] Isolate some explicit usage of task->tgid
With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that this is the same and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> Acked-by: Oleg Nesterov <[EMAIL PROTECTED]> --- fs/exec.c |4 ++-- fs/proc/base.c|4 ++-- include/linux/sched.h |6 ++ kernel/posix-cpu-timers.c | 12 ++-- kernel/posix-timers.c |4 ++-- kernel/ptrace.c |2 +- kernel/signal.c |2 +- mm/oom_kill.c |2 +- 8 files changed, 21 insertions(+), 15 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index f8cebf8..24b669d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1691,6 +1691,12 @@ static inline int has_group_leader_pid(s return p->pid == p->tgid; } +static inline +int same_thread_group(struct task_struct *p1, struct task_struct *p2) +{ + return p1->tgid == p2->tgid; +} + static inline struct task_struct *next_thread(const struct task_struct *p) { return list_entry(rcu_dereference(p->thread_group.next), diff --git a/fs/exec.c b/fs/exec.c index 35013f2..415cefb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -865,8 +865,8 @@ static int de_thread(struct task_struct write_lock_irq(&tasklist_lock); - BUG_ON(leader->tgid != tsk->tgid); - BUG_ON(tsk->pid == tsk->tgid); + BUG_ON(!same_thread_group(leader, tsk)); + BUG_ON(thread_group_leader(tsk)); /* * An exec() starts a new thread group with the * TGID of the previous thread group. Rehash the diff --git a/fs/proc/base.c b/fs/proc/base.c index e3009ab..31e7dfe 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2288,7 +2288,7 @@ retry: * found doesn't happen to be a thread group leader. * As we don't care in the case of readdir. */ - if (!task || !has_group_leader_pid(task)) + if (!task || !thread_group_leader(task)) goto retry; get_task_struct(task); } @@ -2474,7 +2474,7 @@ static struct dentry *proc_task_lookup(s rcu_read_unlock(); if (!task) goto out; - if (leader->tgid != task->tgid) + if (!same_thread_group(leader, task)) goto out_drop_task; result = proc_task_instantiate(dir, dentry, task, NULL); diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c index b53c8fc..68c9637 100644 --- a/kernel/posix-cpu-timers.c +++ b/kernel/posix-cpu-timers.c @@ -21,8 +21,8 @@ static int check_clock(const clockid_t w read_lock(&tasklist_lock); p = find_task_by_pid(pid); - if (!p || (CPUCLOCK_PERTHREAD(which_clock) ? - p->tgid != current->tgid : p->tgid != pid)) { + if (!p || !(CPUCLOCK_PERTHREAD(which_clock) ? + same_thread_group(p, current) : thread_group_leader(p))) { error = -EINVAL; } read_unlock(&tasklist_lock); @@ -308,13 +308,13 @@ int posix_cpu_clock_get(const clockid_t p = find_task_by_pid(pid); if (p) { if (CPUCLOCK_PERTHREAD(which_clock)) { - if (p->tgid == current->tgid) { + if (same_thread_group(p, current)) { error = cpu_clock_sample(which_clock, p, &rtn); } } else { read_lock(&tasklist_lock); - if (p->tgid == pid && p->signal) { + if (thread_group_leader(p) && p->signal) { error = cpu_clock_sample_group(which_clock, p, &rtn); @@ -355,7 +355,7 @@ int posix_cpu_timer_create(struct k_itim p = current; } else { p = find_task_by_pid(pid); - if (p && p->tgid != current->tgid) + if (p && !same_thread_group(p, current)) p = NULL; } } else { @@ -363,7 +363,7 @@ int posix_cpu_timer_create(struct k_itim p = current->group_leader; } else { p = find_task_by_pid(pid); - if
Re: System call interposition/unprotecting the table
On Wed, Aug 15, 2007 at 12:48:35AM +0200, Andi Kleen wrote: > > > In general the .data protection is only considered a debugging > > > feature. I don't know why Fedora enables it in their production > > > kernels. > > > > That would be because we think you are wrong 8) > > Well, it might at best buy you a few weeks/months in > terms of the exploit arms race, but thrash your user's TLBs > forever. Show me a single situation where this matters. When we first enabled, we tried both benchmarks and real-world loads, and it didn't matter at all. Unless something fundamental has changed since then, the story should still be the same. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [SCSI] arcmsr: Fix error handling
Fixed error handling in queuecommand(), now all READ_ and WRITE_ commands are aborted in case of RAID is gone. Before only READ_6 and WRITE_6 commands were aborted. Signed-off-by: Maik Hampel <[EMAIL PROTECTED]> --- diff --git a/drivers/scsi/arcmsr/arcmsr_hba.c b/drivers/scsi/arcmsr/arcmsr_hba.c index 0ddfc21..7a001ad 100644 --- a/drivers/scsi/arcmsr/arcmsr_hba.c +++ b/drivers/scsi/arcmsr/arcmsr_hba.c @@ -1133,17 +1133,27 @@ static int arcmsr_queue_command(struct scsi_cmnd *cmd, uint8_t block_cmd; block_cmd = cmd->cmnd[0] & 0x0f; - if (block_cmd == 0x08 || block_cmd == 0x0a) { - printk(KERN_NOTICE - "arcmsr%d: block 'read/write'" - "command with gone raid volume" - " Cmd = %2x, TargetId = %d, Lun = %d \n" - , acb->host->host_no - , cmd->cmnd[0] - , target, lun); - cmd->result = (DID_NO_CONNECT << 16); - cmd->scsi_done(cmd); - return 0; + switch (block_cmd) { + case READ_6: + case READ_10: + case READ_12: + case READ_16: + case WRITE_6: + case WRITE_10: + case WRITE_12: + case WRITE_16: + printk(KERN_NOTICE + "arcmsr%d: block 'read/write'" + "command with gone raid volume" + " Cmd = %2x, TargetId = %d, Lun = %d \n" + , acb->host->host_no + , cmd->cmnd[0] + , target, lun); + cmd->result = (DID_NO_CONNECT << 16); + cmd->scsi_done(cmd); + return 0; + default: + break; } } if (atomic_read(&acb->ccboutstandingcount) >= -- 1.4.4.4 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
On Fri, Aug 17, 2007 at 01:09:08PM +0530, Satyam Sharma wrote: > > > On Thu, 16 Aug 2007, Paul E. McKenney wrote: > > > On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote: > > > On Thu, Aug 16, 2007 at 09:34:41AM -0700, Paul E. McKenney wrote: > > > > > > > > The compiler can also reorder non-volatile accesses. For an example > > > > patch that cares about this, please see: > > > > > > > > http://lkml.org/lkml/2007/8/7/280 > > > > > > > > This patch uses an ORDERED_WRT_IRQ() in rcu_read_lock() and > > > > rcu_read_unlock() to ensure that accesses aren't reordered with respect > > > > to interrupt handlers and NMIs/SMIs running on that same CPU. > > > > > > Good, finally we have some code to discuss (even though it's > > > not actually in the kernel yet). > > > > There was some earlier in this thread as well. > > Hmm, I never quite got what all this interrupt/NMI/SMI handling and > RCU business you mentioned earlier was all about, but now that you've > pointed to the actual code and issues with it ... Glad to help... > > > First of all, I think this illustrates that what you want > > > here has nothing to do with atomic ops. The ORDERED_WRT_IRQ > > > macro occurs a lot more times in your patch than atomic > > > reads/sets. So *assuming* that it was necessary at all, > > > then having an ordered variant of the atomic_read/atomic_set > > > ops could do just as well. > > > > Indeed. If I could trust atomic_read()/atomic_set() to cause the compiler > > to maintain ordering, then I could just use them instead of having to > > create an ORDERED_WRT_IRQ(). (Or ACCESS_ONCE(), as it is called in a > > different patch.) > > +#define WHATEVER(x) (*(volatile typeof(x) *)&(x)) > > I suppose one could want volatile access semantics for stuff that's > a bit-field too, no? One could, but this is not supported in general. So if you want that, you need to use the usual bit-mask tricks and (for setting) atomic operations. > Also, this gives *zero* "re-ordering" guarantees that your code wants > as you've explained it below) -- neither w.r.t. CPU re-ordering (which > probably you don't care about) *nor* w.r.t. compiler re-ordering > (which you definitely _do_ care about). You are correct about CPU re-ordering (and about the fact that this example doesn't care about it), but not about compiler re-ordering. The compiler is prohibited from moving a volatile access across a sequence point. One example of a sequence point is a statement boundary. Because all of the volatile accesses in this code are separated by statement boundaries, a conforming compiler is prohibited from reordering them. > > > However, I still don't know which atomic_read/atomic_set in > > > your patch would be broken if there were no volatile. Could > > > you please point them out? > > > > Suppose I tried replacing the ORDERED_WRT_IRQ() calls with > > atomic_read() and atomic_set(). Starting with __rcu_read_lock(): > > > > o If "ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++" > > was ordered by the compiler after > > "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1", then > > suppose an NMI/SMI happened after the rcu_read_lock_nesting but > > before the rcu_flipctr. > > > > Then if there was an rcu_read_lock() in the SMI/NMI > > handler (which is perfectly legal), the nested rcu_read_lock() > > would believe that it could take the then-clause of the > > enclosing "if" statement. But because the rcu_flipctr per-CPU > > variable had not yet been incremented, an RCU updater would > > be within its rights to assume that there were no RCU reads > > in progress, thus possibly yanking a data structure out from > > under the reader in the SMI/NMI function. > > > > Fatal outcome. Note that only one CPU is involved here > > because these are all either per-CPU or per-task variables. > > Ok, so you don't care about CPU re-ordering. Still, I should let you know > that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you > want is a full compiler optimization barrier(). No. See above. > [ Your code probably works now, and emits correct code, but that's > just because of gcc did what it did. Nothing in any standard, > or in any documented behaviour of gcc, or anything about the real > (or expected) semantics of "volatile" is protecting the code here. ] Really? Why doesn't the prohibition against moving volatile accesses across sequence points take care of this? > > o If "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1" > > was ordered by the compiler to follow the > > "ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx", and an NMI/SMI > > happened between the two, then an __rcu_read_lock() in the NMI/SMI > > would incorrectly take the "else" clause of the enclosing "if" > > statement. If some other CPU flipped the rcu_ctrlblk.completed > > in the meantime, then the __rcu_read_lock() would (correctl