Re: Bug #7674 (shutdown hd noise) EDIT: wrong address, sorry!
I'm sorry with the lkml users for the unwanted noise. I did a mistake with my mail client. Francesco 2007/3/2, Francesco Pretto <[EMAIL PROTECTED]>: I'll send you a message of the thread. You only have to answer it (with reply-to function of your browser) changing the TO: address with linux-kernel@vger.kernel.org (you don't have to be subscribed, i'm not for example) . Hopefully, it will maintaing headers and it will merge with the rest of the thread. Bye - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Fixes and cleanups for earlyprintk aka boot console.
On Tue, 20 Feb 2007 12:35:49 +0100 Gerd Hoffmann <[EMAIL PROTECTED]> wrote: > The console subsystem already has an idea of a boot console, using the > CON_BOOT flag. The implementation has some flaws though. The major > problem is that presence of a boot console makes register_console() > ignore any other console devices (unless explicitly specified on the > kernel command line). > > This patch fixes the console selection code to *not* consider a boot > console a full-featured one, so the first non-boot console registering > will become the default console instead. This way the unregister call > for the boot console in the register_console() function actually > triggers and the handover from the boot console to the real console > device works smoothly. Added a printk for the handover, so you know > which console device the output goes to when the boot console stops > printing messages. > > The disable_early_printk() call is obsolete with that patch, explicitly > disabling the early console isn't needed any more as it works > automagically with that patch. > > I've walked through the tree, dropped all disable_early_printk() > instances found below arch/ and tagged the consoles with CON_BOOT if > needed. > > The code is tested on x86 only so far. It is probably a good idea to > run it in -mm for a while to shake out any architecture issues which > might show up. Comments? It blows up on powerpc: drivers/built-in.o(.init.text+0x2080): In function `.console_init': : undefined reference to `.disable_early_printk' and the below patch might help. But my confidence level isn't high so I'll drop it for now. I have a feeling this will need careful testing. --- a/arch/x86_64/kernel/early_printk.c~fixes-and-cleanups-for-earlyprintk-aka-boot-console-fix +++ a/arch/x86_64/kernel/early_printk.c @@ -249,17 +249,3 @@ static int __init setup_early_printk(cha } early_param("earlyprintk", setup_early_printk); - -void __init disable_early_printk(void) -{ - if (!early_console_initialized || !early_console) - return; - if (!keep_early) { - printk("disabling early console\n"); - unregister_console(early_console); - early_console_initialized = 0; - } else { - printk("keeping early console\n"); - } -} - diff -puN drivers/char/tty_io.c~fixes-and-cleanups-for-earlyprintk-aka-boot-console-fix drivers/char/tty_io.c --- a/drivers/char/tty_io.c~fixes-and-cleanups-for-earlyprintk-aka-boot-console-fix +++ a/drivers/char/tty_io.c @@ -141,8 +141,6 @@ static DECLARE_MUTEX(allocated_ptys_lock static int ptmx_open(struct inode *, struct file *); #endif -extern void disable_early_printk(void); - static void initialize_tty_struct(struct tty_struct *tty); static ssize_t tty_read(struct file *, char __user *, size_t, loff_t *); @@ -3889,13 +3887,6 @@ void __init console_init(void) /* Setup the default TTY line discipline. */ (void) tty_register_ldisc(N_TTY, &tty_ldisc_N_TTY); - /* -* set up the console device so that later boot sequences can -* inform about problems etc.. -*/ -#ifdef CONFIG_EARLY_PRINTK - disable_early_printk(); -#endif call = __con_initcall_start; while (call < __con_initcall_end) { (*call)(); _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug #7674 (shutdown hd noise)
2007/3/2, Dan Gilliam <[EMAIL PROTECTED]>: Hi Francesco, I just tried to submit a plea to that address, but it's not letting me post to it (refused). Help! Dan I'll send you a message of the thread. You only have to answer it (with reply-to function of your browser) changing the TO: address with linux-kernel@vger.kernel.org (you don't have to be subscribed, i'm not for example) . Hopefully, it will maintaing headers and it will merge with the rest of the thread. Bye - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Nick Piggin wrote: > > Sure we will. And you believe that the the newer controllers will be able > > to magically shrink the the SG lists somehow? We will offload the > > coalescing of the page structs into bios in hardware or some such thing? > > And the vmscans etc too? > > As far as pagecache page management goes, is that an issue for you? > I don't want to know about how many billions of pages for some operation, > just some profiles. If there are billions of pages in the system and we are allocating and deallocating then pages need to be aged. If there are just few pages freeable then we run into issues. > > > I understand you have controllers (or maybe it is a block layer limit) > > > that doesn't work well with 4K pages, but works OK with 16K pages. > > Really? This is the first that I have heard about it. > Maybe that's the issue you're running into. Oh, I am running into an issue on a system that does not yet exist? I am extrapolating from the problems that we commonly see now. Those will get worse the more memory increases. > > > This is not something that we would introduce variable sized pagecache > > > for, surely. > > I am not sure where you get the idea that this is the sole reason why we > > need to be able to handle larger contiguous chunks of memory. > I'm not saying that. You brought up this subject of variable sized pagecache. You keep bringing up the 4k/16k issue into this for some reason. I want just the ability to handle large amounts of memory. Larger page sizes are a way to accomplish that. > Eventually, increasing x86 page size a bit might be an idea. We could even > do it in software if CPU manufacturers don't for us. A bit? Are we back to the 4k/16k issue? We need to reach 2M at mininum. Some way to handle continuous memory segments of 1GB and larger effectively would be great. > That doesn't buy us a great deal if you think there is this huge looming > problem with struct page management though. I am not the first one See Rik's posts regarding the reasons for his new page replacement algorithms. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread
> From: Paul E. McKenney <[EMAIL PROTECTED]> > > Remove PF_NOFREEZE from the rcutorture thread, adding a try_to_freeze() call > as > required. > > Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]> > Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]> > Acked-by: Pavel Machek <[EMAIL PROTECTED]> > --- > kernel/rcutorture.c |3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > Index: linux-2.6.20-mm2/kernel/rcutorture.c > === > --- linux-2.6.20-mm2.orig/kernel/rcutorture.c 2007-02-25 12:07:15.0 > +0100 > +++ linux-2.6.20-mm2/kernel/rcutorture.c 2007-02-25 12:49:23.0 > +0100 > @@ -46,6 +46,7 @@ > #include > #include > #include > +#include > > MODULE_LICENSE("GPL"); > MODULE_AUTHOR("Paul E. McKenney <[EMAIL PROTECTED]> and " > @@ -585,7 +586,6 @@ rcu_torture_writer(void *arg) > > VERBOSE_PRINTK_STRING("rcu_torture_writer task started"); > set_user_nice(current, 19); > - current->flags |= PF_NOFREEZE; > > do { > schedule_timeout_uninterruptible(1); > @@ -607,6 +607,7 @@ rcu_torture_writer(void *arg) > } > rcu_torture_current_version++; > oldbatch = cur_ops->completed(); > + try_to_freeze(); > } while (!kthread_should_stop() && !fullstop); > VERBOSE_PRINTK_STRING("rcu_torture_writer task stopping"); > while (!kthread_should_stop()) Paul, Any reasons for not try_to_freeze()'ing the fakewriter and the reader threads?? (Ok, I admit, I haven't looked into the code for the reason which might be obvious.) thanks gautham. -- Gautham R Shenoy Linux Technology Center IBM India. "Freedom comes with a price tag of responsibility, which is still a bargain, because Freedom is priceless!" - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc1: known regressions (part 2)
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > But most likely, 9f4bd5dd is actually already bad, and what you are > seeing is two *different* bugs that just have the same symptoms > ("suspend doesn't work"). the situation is simpler than that: there is a /known/ bug, and i marked the bugfix commit as 'good'. I never met such a multiple-bugs scenario before and forgot that git-bisect could easily pick a tree without this essential bugfix and would not be able to make a distinction between the two types of badness. I'll try what i've described in the previous mail: mark all bisection points that do not include f3ccb06f as 'good' - thus 'merging' the known-bad area with the first known-good commit, and thus eliminating it from the bisection space. (but it might also be useful to have a "git-bisect must-include" kind of command that would allow this to be automated: mark a particular tree as an essential component of the search space.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fastboot] [PATCH RFC 0/5] hard_smp_processor_id overhaul
On Thu, Mar 01, 2007 at 04:16:13PM +0900, Fernando Luis Vázquez Cao wrote: > With the advent of kdump, the assumption that the boot CPU when running > an UP kernel is always the CPU with a hardware ID of 0 (usually referred > to as BSP on some architectures) does not hold true anymore. The reason > being that the dump capture kernel boots on the crashed CPU (the CPU > that invoked crash_kexec). > > As a consequence, the hardcoding of hard_smp_processor_id() to 0 on UP > systems (see "linux/smp.h") is not correct. > > This patch-set does the following: > > 1- Remove hardcoding of hard_smp_processor_id on UP systems. > > 2- Ask the hardware when possible to obtain the hardware processor id on > i386, x86_64, and ia64, independently of whether CONFIG_SMP is set or > not. > > 3- Move definition of hard_smp_processor_id for the UP case to asm/smp.h > on alpha, m32r, powerpc, s390, sparc, sparc64, and um architectures. I > guess that hardware features could be used to implement > hard_smp_processor_id even in the UP case, but since I am not an expert > in this architectures I just move the definition. > > The patches have been tested on i386, x86_64, and ia64. Hi Fernando, These patches seem find to me. Tested on ia64 (Tiger2) Acked: Simon Horman <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc1: known regressions (part 2)
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > Btw, you seem to have re-ordered the commits - the above is not the > order you did the bisection in. The known-good commit (f3ccb06..) is > in the middle. [...] no - i simply picked them by hand, based on looking at gittk output, because bisection did not appear to find anything useful: 9f4bd5dde81b5cb94e4f52f2f05825aa0422f1ff is first bad commit And via that method i found a couple of more 'good' points - which git-bisect never picked up by itself. (and i did 3-4 separate git-bisect sessions, one of them was a "git-bisect start drivers/acpi/" - which is the main area of suspicion). I looked at git-bisect visualize more than once, and i've attached one of the bisection logs below. i also think i know what happens. Firstly, my testing is reliable, as i mentioned it in the other mail i frequently re-visited commits to make sure that none of my bad/good decisions is spurios - but no, the test results are extremely reproducable: either the laptop resumes properly after flashing its disk light or it does not. the problem i think is that i simply took git-bisect's behavior for granted (i used it many times already) but forgot about a very basic precondition: git-bisect will find only a /single/ good->bad transition. If there is a bad->good transition combined with a good->bad transition then git-bisect will think it's the same 'badness', while it's a /former/ badness that it is honing in on - totally sending the bisection off into la-la-land. so as i mentioned it in the first mail: i /know/ that this commit is a bad->good transition point: f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 /and i only want to test commits that include this commit/ - because i know that without this commit git-bisect confuses the /other/ breakage with the new breakage. In the bisection log below, this choice of git-bisect: ee404566f97f9254433399fbbcfa05390c7c55f7 is 'bad' according to testing, but that's 'another' badness - and i missed it. Now, having slept on it, the solution is very simple: whenever git-bisect picks a commit for which the following command comes up empty: git-log | grep f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 then i'll mark it "git-bisect good" - artificially marking the older badness as a 'good' area. That way git-bisect will find the right good->bad transition point. btw., that's why i tried to pick up commits by hand, making sure that commit f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 is always included - but got lost in the maze of the commit graph, and didnt realize that there is a simple solution. Nevertheless i wanted to dump the information i already gathered. Those commits were totally out of order, etc. - they were picked by a poor human who is much worse at walking graphs than git-bisect ;-) Ingo git-bisect start # bad: [01363220f5d23ef68276db8974e46a502e43d01d] [PARISC] clocksource: Move update_cr16_clocksource later in boot git-bisect bad 01363220f5d23ef68276db8974e46a502e43d01d # good: [f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38] ACPI: Disable wake GPEs only once. git-bisect good f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 # bad: [ee404566f97f9254433399fbbcfa05390c7c55f7] sysctl: mips/au1000: remove sys_sysctl support git-bisect bad ee404566f97f9254433399fbbcfa05390c7c55f7 # bad: [c827ba4cb49a30ce581201fd0ba2be77cde412c7] Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6 git-bisect bad c827ba4cb49a30ce581201fd0ba2be77cde412c7 # bad: [68a696a01f482859a9fe937249e8b3d44252b610] Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-tc git-bisect bad 68a696a01f482859a9fe937249e8b3d44252b610 # bad: [1c433fbda4896a6455d97b66a4f2646cbdd52a8c] [ALSA] soc - 0.13 ASoC headers git-bisect bad 1c433fbda4896a6455d97b66a4f2646cbdd52a8c # bad: [048b945077bdc7e8dff5d5810ff2a0ced3590ca9] [ALSA] echoaudio, add TLV support git-bisect bad 048b945077bdc7e8dff5d5810ff2a0ced3590ca9 # bad: [c07584c83287ae5a13cc836f69a1d824ad068c66] [ALSA] hda-codec - Add support for Medion laptops git-bisect bad c07584c83287ae5a13cc836f69a1d824ad068c66 # bad: [dbc6b6ad767c86907db373e85139b0e975ba7599] [ALSA] ASoC codecs: generic AC97 support git-bisect bad dbc6b6ad767c86907db373e85139b0e975ba7599 # bad: [b66b3cfe6c2f6560f351278883a325b6ebc478f5] [ALSA] hda_intel: increase maximum DMA buffer size to 1024MB git-bisect bad b66b3cfe6c2f6560f351278883a325b6ebc478f5 # bad: [12b131c4cf3eb1dc8a60082a434b7b100774c2e7] [ALSA] allow registering an alsa device with struct device pointer git-bisect bad 12b131c4cf3eb1dc8a60082a434b7b100774c2e7 # bad: [e4f8e656d8c152c08cd44d0e3c21f009fab09952] [ALSA] usb-audio: allow pausing git-bisect bad e4f8e656d8c152c08cd44d0e3c21f009fab09952 # bad: [1700f3080d98323e91864d67cb9f6d46f818ccf0] [ALSA] usb-audio: merge playback/capture hardware information structs git-bisect bad 1700f3080d98323e91864d67cb9f6d46f818ccf0 # bad: [9f4bd5dde81b5cb94e4f52f2f05825aa0422f1ff] [ALSA] snd-emu
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, Mar 01, 2007 at 10:51:00PM -0800, Christoph Lameter wrote: > On Fri, 2 Mar 2007, Nick Piggin wrote: > > > > There was no talk about slightly. 1G page size would actually be quite > > > convenient for some applications. > > > > But it is far from convenient for the kernel. So we have hugepages, so > > we can stay out of the hair of those applications and they can stay out > > of hours. > > Huge pages cannot do I/O so we would get back to the gazillions of pages > to be handled for I/O. I'd love to have I/O support for huge pages. This > would address some of the issues. Can't direct IO from a hugepage? > > > Writing a terabyte of memory to disk with handling 256 billion page > > > structs? In case of a system with 1 petabyte of memory this may be rather > > > typical and necessary for the application to be able to save its state > > > on disk. > > > > But you will have newer IO controllers, faster CPUs... > > Sure we will. And you believe that the the newer controllers will be able > to magically shrink the the SG lists somehow? We will offload the > coalescing of the page structs into bios in hardware or some such thing? > And the vmscans etc too? As far as pagecache page management goes, is that an issue for you? I don't want to know about how many billions of pages for some operation, just some profiles. > > Is it a problem or isn't it? Waving around the 256 billion number isn't > > impressive because it doesn't really say anything. > > It is the number of items that needs to be handled by the I/O layer and > likely by the SG engine. The number is irrelevant, it is the rate that is important. > > I understand you have controllers (or maybe it is a block layer limit) > > that doesn't work well with 4K pages, but works OK with 16K pages. > > Really? This is the first that I have heard about it. > Maybe that's the issue you're running into. > > This is not something that we would introduce variable sized pagecache > > for, surely. > > I am not sure where you get the idea that this is the sole reason why we > need to be able to handle larger contiguous chunks of memory. I'm not saying that. You brought up this subject of variable sized pagecache. > How about coming up with a response to the issue at hand? How do I write > back 1 Terabyte effectively? Ok this may be an exotic configuration today > but in one year this may be much more common. Memory sizes keep on > increasing and so is the number of page structs to be handled for I/O. At > some point we need a solution here. Considering you're just handwaving about the actual problems, I don't know. I assume you're sitting in front of some workload that has gone wrong, so can't you elaborate? Eventually, increasing x86 page size a bit might be an idea. We could even do it in software if CPU manufacturers don't for us. That doesn't buy us a great deal if you think there is this huge looming problem with struct page management though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.20-rc2] gpio_direction_output() needs an initial value
> It's been pointed out that output GPIOs should have an initial value, to > avoid signal glitching ... among other things, it can be some time before > a driver is ready. This patch corrects that oversight, fixing > > - documentation > - platforms supporting the GPIO interface > - users of that call (just one for now, others are pending) > > Note that most platforms are clear about the hardware letting the output > value be set before the pin direction is changed, but the s3c241x docs > are vague on that topic ... so those chips might not avoid the glitches. > > Signed-off-by: David Brownell <[EMAIL PROTECTED]> Acked-by: Milan Svoboda <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] libata: Cable detection fixes
On Thu, Mar 01, 2007 at 08:33:17PM -0500, Jeff Garzik wrote: > > That little change, buried in the middle of Alan's patch, changes the > probing order for a /lot/ of devices, possibly millions, when you > consider that it changes behavior of ata_piix (Intel SATA) as well as > all the not-yet-default PATA controllers. Hm, I got recently hands on a hardware where 2.6.21-rc1 based kernels from Fedora rawhide simply do not boot as there is no way to get to disks. I would not mind some change in behavior although so far I can boot at least some earlier kernels. This looks like ATIIXP issue and details are here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229621 Changelogs for kernels in question have this: * Wed Feb 21 2007 Dave Jones <[EMAIL PROTECTED]> - 2.6.21-rc1 Michal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: belkin bulldog ups monitor vs 2.6.21-rc2
On Friday 02 March 2007, Con Kolivas wrote: >On 02/03/07, Gene Heskett <[EMAIL PROTECTED]> wrote: >> Greetings; >> >> I just rebooted to 2.6.21-rc2 and noted that getting x up and running >> was about 15 seconds longer than usual. When it got a bash shell >> going I went to it and ran htop which showed that the bulldog monitor >> was taking 90% of the cpu. Killed it, then restarted it, but when I >> ran the gui which ran fine and then stopped the gui, the daemon once >> again went hog wild and had to be killed, and I'm losing my kmail >> composer focus for 30 seconds at a time now that amanda is making her >> nightly run. >> >> There is nothing in the log about it other than from xinetd as it ran >> the amanda server stuff. >> >> Not quite ready for prime time methinks. Using the ck scheduler, this >> is terrible performance, virtually no multitasking. Back to >> 2.6.20-ck1 in the morning if it lives the rest of the night. > >HI Gene. > >I'm not sure if you're saying here that the performance is terrible on >2.6.21-rc2 only with the -ck scheduler, or only 2.6.21-rc2, or that >2.6.20-ck1 is terrible or that it fixes the problem. Can you please >clarify this? I miss-spoke above now that I read it again, sorry Con. I think I thought my fingers had put 'Comparing' in front of the 'Using' above. This time of the night, my mind has been known to be running a chapter or more ahead of (or in some cases behind) my fingers. 2.6.20-ck1 runs great, 2.6.21-rc2 was not only a dog, it fed amanda a bunch of lsd via bad data from tar, so tar when told to do a level 1 while 21-rc2 (without your patch) was running, it actually did a level 0, and predictably ran out of vtape. /usr/pix didn't change over 7GB of its contents overnight, in fact nothing changed there yesterday, but tar sure went on a rampage. Sorry about the confusion. I'm back in 2.6.20-ck1 and everythings cool. >Regards, >-ck -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] slab: remove colouroff from struct slab
On Thu, 22 Feb 2007 14:37:38 +0200 (EET) Pekka J Enberg <[EMAIL PROTECTED]> wrote: > As the color offset is always within the first page of the slab, > virt_to_page() works just fine without slabp->colouroff. kernel BUG at mm/slab.c:1658! invalid opcode: [#1] SMP last sysfs file: /block/hdc/range Modules linked in: CPU:1 EIP:0060:[]Not tainted VLI EFLAGS: 00010246 (2.6.21-rc2-mm1 #7) EIP is at kmem_freepages+0xc8/0xd0 eax: 4000 ebx: c106e730 ecx: edx: esi: 0001 edi: c21fcbe0 ebp: c2231e9c esp: c2231e8c ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Process swapper (pid: 0, ti=c223 task=c222cac0 task.ti=c223) Stack: c21fcbe0 c21fcbe0 f6252020 0002 c2231eac c017409c f62b7020 c1b74f80 c2231ec0 c013078a c1b74ffc c05502c0 c2231ec8 c0130901 c2231ee0 c012495a c05519d0 0003 c04faf68 c0551a20 c2231efc c01242d7 000a Call Trace: [] show_trace_log_lvl+0x1a/0x30 [] show_stack_log_lvl+0xa9/0xd0 [] show_registers+0x1e9/0x2f0 [] die+0x11a/0x250 [] do_trap+0x91/0xc0 [] do_invalid_op+0x97/0xb0 [] error_code+0x7c/0x84 [] kmem_rcu_free+0x1c/0x50 [] __rcu_process_callbacks+0x6a/0x1c0 [] rcu_process_callbacks+0x21/0x50 [] tasklet_action+0x5a/0xe0 [] __do_softirq+0x87/0x100 [] do_softirq+0x57/0x60 [] irq_exit+0x47/0x50 [] smp_apic_timer_interrupt+0x55/0x90 [] apic_timer_interrupt+0x33/0x38 [] cpu_idle+0x7f/0xe0 [] start_secondary+0x281/0x3c0 [<>] 0x0 === Code: fe ff 58 5b 5e 5f 5d c3 8b 03 89 f1 ba 09 00 00 00 f7 d9 c1 e8 1e 8d 04 40 8d 04 c0 c1 e0 05 05 40 f1 4c c0 e8 9a dc fe ff eb 95 <0f> 0b eb fe 8d 74 26 00 55 b9 6c dd 79 c0 89 e5 ba 52 fc 46 c0 EIP: [] kmem_freepages+0xc8/0xd0 SS:ESP 0068:c2231e8c # # Automatically generated make config: don't edit # Linux kernel version: 2.6.21-rc2-mm1 # Thu Mar 1 23:05:37 2007 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y # CONFIG_POSIX_MQUEUE is not set CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # CONFIG_UTS_NS is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_IKCONFIG=y # CONFIG_IKCONFIG_PROC is not set # CONFIG_CPUSETS is not set CONFIG_SYSFS_DEPRECATED=y # CONFIG_RELAY is not set CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y # CONFIG_SYSCTL_SYSCALL is not set CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_BLOCK=y # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="anticipatory" # # Processor type and features # # CONFIG_TICK_ONESHOT is not set # CONFIG_NO_HZ is not set # CONFIG_HIGH_RES_TIMERS is not set CONFIG_SMP=y CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_ES7000 is not set # CONFIG_PARAVIRT is not set # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set CONFIG_MPENTIUMIII=y # CONFIG_MPENTIUMM is not set # CONFIG_MCORE2 is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWI
Re: [RFC] Heads up on sys_fallocate()
Andrew Morton wrote: > Perhaps Ulrich can comment. I was out of town, hence the delay. I think that if there is no support for the syscall the correct answer is to return ENOSYS. In this case the current userlevel code would be used and ENOSYS is also used to trigger the use of the compat code in glibc in case the syscall does not exist at all. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: belkin bulldog ups monitor vs 2.6.21-rc2
On Friday 02 March 2007, Gene Heskett wrote: >Greetings; > >I just rebooted to 2.6.21-rc2 and noted that getting x up and running > was about 15 seconds longer than usual. When it got a bash shell going > I went to it and ran htop which showed that the bulldog monitor was > taking 90% of the cpu. Killed it, then restarted it, but when I ran > the gui which ran fine and then stopped the gui, the daemon once again > went hog wild and had to be killed, and I'm losing my kmail composer > focus for 30 seconds at a time now that amanda is making her nightly > run. > >There is nothing in the log about it other than from xinetd as it ran > the amanda server stuff. > >Not quite ready for prime time methinks. Using the ck scheduler, this > is terrible performance, virtually no multitasking. Back to 2.6.20-ck1 > in the morning if it lives the rest of the night. Addendum, amanda finished early, it seems tar thought every level was a level 0, so it ran out of storage after only 3 dle's were processed and backed up. There are about 25 dle's. It tried to put 11GB on an 8GB vtape, which because it was a vtape, it could do. So it appears something in the ext3 filesystem is sadly miss-informing tar when it does the estimate scan vs doing the real file reading. Or the scan is updating the ctime? I'm back on 2.6.20-ck1 & everything is copacetic again. I'll find out if the filesystem is damaged tomorrow night cause if the ctimes are all screwed up, amanda will effectively be starting from scratch. That is not exactly a Good Thing(TM). I did find the ls -lt command, and the filesystem looks ok timewise when rebooted now. I have no more ready clues without your able questions to guide me on this. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Null pointer dereference in sysfs_readdir()
On Thu, Mar 01, 2007 at 05:54:01PM -0800, Kunal Trivedi wrote: > 5) OOPS messages from console. ><1>Unable to handle kernel NULL pointer dereference at virtual > address 0018 ><1> printing eip: ><4>e01a40c9 ><1>*pde = ><1>Oops: [#1] ><4>SMP ><4>Modules linked in: ipt_state ip_conntrack iptable_filter > cls_u32 iptable_mangle lm85 i2c_i801 w83627hf_wdt w83627hf i2c_sensor > i2c_isa i2c_core slcmi ip_tables e7xxx_edac edac_mc ><4>CPU:2 ><4>EIP:0060:[]Tainted: PF VLI ><4>EFLAGS: 00010286 (2.6.9-34.EL-i386_SMP) ><4>EIP is at sysfs_readdir+0xd9/0x210 ><4>eax: ebx: f7d6b104 ecx: 0006 edx: 0020 ><4>esi: f7d6b100 edi: f7f1cb87 ebp: f7f1cb80 esp: ef432f48 ><4>ds: 007b es: 007b ss: 0068 ><4>Process sensors (pid: 2933, threadinfo=ef432000 task=f562c030) ><4>Stack: 0002 016c32f7 000a f7d6cc8c 0006 > f7ddbbc4 e017a670 ><4> ef432fa0 ed6e7280 e0409ba0 ed6e7280 f6f180b0 f6f18120 > e017a33f ef432fa0 ><4> e017a670 09ce61b4 ed6e7280 fff7 e017a81e > 09ce6204 09ce61e4 ><4>Call Trace: ><4> [] filldir64+0x0/0x140 ><4> [] vfs_readdir+0xaf/0xd0 ><4> [] filldir64+0x0/0x140 ><4> [] sys_getdents64+0x6e/0xb6 ><4> [] syscall_call+0x7/0xb ><4>Code: 26 00 89 f0 e8 89 e8 ff ff 89 c5 b9 ff ff ff ff 31 c0 89 > ef f2 ae f7 d1 49 89 4c 24 14 8b 46 20 85 c0 0f 84 22 01 00 00 8b 40 > 10 <8b> 50 18 0f b7 46 1c 89 54 24 08 8b 4c 24 24 c1 e8 0c 89 44 24 > > Please advice. I suggest contacting the vendor providing the support for this old kernel version, they should be able to help you out (although they might ask you to not run a closed source driver in your kernel, as that probably voids any support contract you might have.) thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007 22:51:00 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > I'd love to have I/O support for huge pages. direct-IO works. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: belkin bulldog ups monitor vs 2.6.21-rc2
On 02/03/07, Gene Heskett <[EMAIL PROTECTED]> wrote: Greetings; I just rebooted to 2.6.21-rc2 and noted that getting x up and running was about 15 seconds longer than usual. When it got a bash shell going I went to it and ran htop which showed that the bulldog monitor was taking 90% of the cpu. Killed it, then restarted it, but when I ran the gui which ran fine and then stopped the gui, the daemon once again went hog wild and had to be killed, and I'm losing my kmail composer focus for 30 seconds at a time now that amanda is making her nightly run. There is nothing in the log about it other than from xinetd as it ran the amanda server stuff. Not quite ready for prime time methinks. Using the ck scheduler, this is terrible performance, virtually no multitasking. Back to 2.6.20-ck1 in the morning if it lives the rest of the night. HI Gene. I'm not sure if you're saying here that the performance is terrible on 2.6.21-rc2 only with the -ck scheduler, or only 2.6.21-rc2, or that 2.6.20-ck1 is terrible or that it fixes the problem. Can you please clarify this? Regards, -ck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] scatterlist.h needs types.h
Hi Andrew, On Thu, 1 Mar 2007 16:11:06 -0800, Andrew Morton wrote: > On Thu, 1 Mar 2007 13:55:16 +0100 > Jean Delvare <[EMAIL PROTECTED]> wrote: > > > Most architectures' scatterlist.h use the type dma_addr_t, but omit > > to include which defines it. This could lead to build > > failures, so let's add the missing includes. > > _does_ it actually lead to build errors? If so, 2.6.21. If not, 2.6.22. No known build error at the moment, so 2.6.22 is fine with me. I'm working on a patch cleaning up the inclusion of across the whole kernel, and this is how I've hit the problem. I'll post that patch later today for comments. Thanks, -- Jean Delvare - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Nick Piggin wrote: > > There was no talk about slightly. 1G page size would actually be quite > > convenient for some applications. > > But it is far from convenient for the kernel. So we have hugepages, so > we can stay out of the hair of those applications and they can stay out > of hours. Huge pages cannot do I/O so we would get back to the gazillions of pages to be handled for I/O. I'd love to have I/O support for huge pages. This would address some of the issues. > > Writing a terabyte of memory to disk with handling 256 billion page > > structs? In case of a system with 1 petabyte of memory this may be rather > > typical and necessary for the application to be able to save its state > > on disk. > > But you will have newer IO controllers, faster CPUs... Sure we will. And you believe that the the newer controllers will be able to magically shrink the the SG lists somehow? We will offload the coalescing of the page structs into bios in hardware or some such thing? And the vmscans etc too? > Is it a problem or isn't it? Waving around the 256 billion number isn't > impressive because it doesn't really say anything. It is the number of items that needs to be handled by the I/O layer and likely by the SG engine. > I understand you have controllers (or maybe it is a block layer limit) > that doesn't work well with 4K pages, but works OK with 16K pages. Really? This is the first that I have heard about it. > This is not something that we would introduce variable sized pagecache > for, surely. I am not sure where you get the idea that this is the sole reason why we need to be able to handle larger contiguous chunks of memory. How about coming up with a response to the issue at hand? How do I write back 1 Terabyte effectively? Ok this may be an exotic configuration today but in one year this may be much more common. Memory sizes keep on increasing and so is the number of page structs to be handled for I/O. At some point we need a solution here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] Vmi fix highpte
Zachary Amsden wrote: > Yeah, actually that does work, since you pass the km_type, we can use > that. But I would rather not respin this for 2.6.21; getting this > 100% right can be tricky, and we've already done a good deal of > testing on this patch the way it is. It seems fairly low risk to me; its basically the same structure with the same calls happening in the same order, but just slightly rearranged in the source. Of course, if I'd seen this patch earlier I could have given you earlier feedback... > Do you have any objection to me creating a patch for -mm tree that > implements kmap_atomic_pte the way you have described above and > attaching it to the Xen patch series, but leaving the current patch as > is for now? Not particularly, but it seems odd to put something in knowing its going to be immediately replaced. What's the urgency? > Thanks, (and thanks for the suggestion - I was a little worried about > how it would play with Xen when HIGHPTE support came around, but it > looks like it will work for both of us with just one paravirt-op). Yeah, the kpte_clear_flush change helped as well. I have a patch to make that into a pvop as well, since its useful to do the clear+flush in a single call. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] md: Fix for raid6 reshape.
On Thursday March 1, [EMAIL PROTECTED] wrote: > On Fri, 2 Mar 2007 15:56:55 +1100 NeilBrown <[EMAIL PROTECTED]> wrote: > > > - conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1); > > + conf->expand_progress = (sector_nr + i) * new_data_disks); > > ahem. It wasn't like that when I tested it, honest... But the original got caught up with some other changes which were not really related so I removed them all and just made this change manually and totally messed it up (again). Sorry. Of course it should be > > + conf->expand_progress = (sector_nr + i) * new_data_disks; NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] md: Fix for raid6 reshape.
On Fri, 2 Mar 2007 15:56:55 +1100 NeilBrown <[EMAIL PROTECTED]> wrote: > - conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1); > + conf->expand_progress = (sector_nr + i) * new_data_disks); ahem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] Vmi fix highpte
Jeremy Fitzhardinge wrote: Jeremy Fitzhardinge wrote: Hm, I don't think this interface will work for Xen. In Xen, whenever a pagetable page gets mapped, it must be mapped RO. map_pt_hook gets called after the mapping has already been created, so its too late for Xen. I was planning on adding kmap_atomic_pte() for use in pte_offset_map*(), which would be wired through to paravirt_ops to allow Xen to make this a RO mapping. Would this be sufficient for you to do your vmi thing? Something like this (compiled, untested). J diff -r 972e84c265cf arch/i386/kernel/paravirt.c --- a/arch/i386/kernel/paravirt.c Thu Mar 01 19:12:49 2007 -0800 +++ b/arch/i386/kernel/paravirt.c Thu Mar 01 19:38:42 2007 -0800 @@ -32,6 +32,7 @@ #include #include #include +#include /* nop stub */ void _paravirt_nop(void) @@ -605,6 +606,8 @@ struct paravirt_ops paravirt_ops = { .kpte_clear_flush = native_kpte_clear_flush, + .kmap_atomic_pte = native_kmap_atomic_pte, + #ifdef CONFIG_X86_PAE .set_pte_atomic = native_set_pte_atomic, .set_pte_present = native_set_pte_present, diff -r 972e84c265cf arch/i386/mm/highmem.c --- a/arch/i386/mm/highmem.cThu Mar 01 19:12:49 2007 -0800 +++ b/arch/i386/mm/highmem.cThu Mar 01 19:38:42 2007 -0800 @@ -26,7 +26,7 @@ void kunmap(struct page *page) * However when holding an atomic kmap is is not legal to sleep, so atomic * kmaps are appropriate for short, tight code paths only. */ -void *kmap_atomic(struct page *page, enum km_type type) +void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot) { enum fixed_addresses idx; unsigned long vaddr; @@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu return page_address(page); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); + set_pte(kmap_pte-idx, mk_pte(page, prot)); return (void*) vaddr; +} + +void *kmap_atomic(struct page *page, enum km_type type) +{ + return _kmap_atomic(page, type, kmap_prot); } Yeah, actually that does work, since you pass the km_type, we can use that. But I would rather not respin this for 2.6.21; getting this 100% right can be tricky, and we've already done a good deal of testing on this patch the way it is. Do you have any objection to me creating a patch for -mm tree that implements kmap_atomic_pte the way you have described above and attaching it to the Xen patch series, but leaving the current patch as is for now? Thanks, (and thanks for the suggestion - I was a little worried about how it would play with Xen when HIGHPTE support came around, but it looks like it will work for both of us with just one paravirt-op). Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, Mar 01, 2007 at 10:19:48PM -0800, Christoph Lameter wrote: > On Fri, 2 Mar 2007, Nick Piggin wrote: > > > > >From the I/O controller and from the application. > > > > Why doesn't the application need to deal with TLB entries? > > Because it may only operate on a small section of the file and hopefully > splice the rest through? But yes support for mmapped I/O would be > necessary. So you're talking about copying a file from one location to another? > > > This would only be a temporary fix pushing the limits to the double or so? > > > > And using slightly larger page sizes isn't? > > There was no talk about slightly. 1G page size would actually be quite > convenient for some applications. But it is far from convenient for the kernel. So we have hugepages, so we can stay out of the hair of those applications and they can stay out of hours. > > > Amortized? The controller still would have to hunt down the 4kb page > > > pieces that we have to feed him right now. Result: Huge scatter gather > > > lists that may themselves create issues with higher page order. > > > > What sort of numbers do you have for these controllers that aren't > > very good at doing sg? > > Writing a terabyte of memory to disk with handling 256 billion page > structs? In case of a system with 1 petabyte of memory this may be rather > typical and necessary for the application to be able to save its state > on disk. But you will have newer IO controllers, faster CPUs... Is it a problem or isn't it? Waving around the 256 billion number isn't impressive because it doesn't really say anything. > > Isn't the issue was something like your IO controllers have only a > > limited number of sg entries, which is fine with 16K pages, but with > > 4K pages that doesn't give enough data to cover your RAID stripe? > > > > We're never going to do a variable sized pagecache just because of that. > > No, we need support for larger page sizes than 16k. 16k has not been fine > for a couple of years. We only agreed to 16k because that was the common > consensus. Best performance was always at 64k 4 years ago (but then we > have no numbers for higher page sizes yet). Now we would prefer much > larger sizes. But you are in a tiny minority, so it is not so much a question of what you prefer, but what you can make do with without being too intrusive. I understand you have controllers (or maybe it is a block layer limit) that doesn't work well with 4K pages, but works OK with 16K pages. This is not something that we would introduce variable sized pagecache for, surely. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] Vmi fix highpte
Zachary Amsden wrote: > That doesn't quite work, since we need to know which of the two - > KM_PTE0 or KM_PTE1 is being mapped. But it could be moved to before > the mapping, as you need, and take this as a parameter. Err, kmap_atomic_pte gets passed the type - KM_PTE0 or KM_PTE1. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] Vmi fix highpte
Jeremy Fitzhardinge wrote: Jeremy Fitzhardinge wrote: Hm, I don't think this interface will work for Xen. In Xen, whenever a pagetable page gets mapped, it must be mapped RO. map_pt_hook gets called after the mapping has already been created, so its too late for Xen. I was planning on adding kmap_atomic_pte() for use in pte_offset_map*(), which would be wired through to paravirt_ops to allow Xen to make this a RO mapping. Would this be sufficient for you to do your vmi thing? Something like this (compiled, untested). J diff -r 972e84c265cf arch/i386/kernel/paravirt.c --- a/arch/i386/kernel/paravirt.c Thu Mar 01 19:12:49 2007 -0800 +++ b/arch/i386/kernel/paravirt.c Thu Mar 01 19:38:42 2007 -0800 @@ -32,6 +32,7 @@ #include #include #include +#include /* nop stub */ void _paravirt_nop(void) @@ -605,6 +606,8 @@ struct paravirt_ops paravirt_ops = { .kpte_clear_flush = native_kpte_clear_flush, + .kmap_atomic_pte = native_kmap_atomic_pte, + That doesn't quite work, since we need to know which of the two - KM_PTE0 or KM_PTE1 is being mapped. But it could be moved to before the mapping, as you need, and take this as a parameter. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Nick Piggin wrote: > > >From the I/O controller and from the application. > > Why doesn't the application need to deal with TLB entries? Because it may only operate on a small section of the file and hopefully splice the rest through? But yes support for mmapped I/O would be necessary. > > This would only be a temporary fix pushing the limits to the double or so? > > And using slightly larger page sizes isn't? There was no talk about slightly. 1G page size would actually be quite convenient for some applications. > > Amortized? The controller still would have to hunt down the 4kb page > > pieces that we have to feed him right now. Result: Huge scatter gather > > lists that may themselves create issues with higher page order. > > What sort of numbers do you have for these controllers that aren't > very good at doing sg? Writing a terabyte of memory to disk with handling 256 billion page structs? In case of a system with 1 petabyte of memory this may be rather typical and necessary for the application to be able to save its state on disk. > Isn't the issue was something like your IO controllers have only a > limited number of sg entries, which is fine with 16K pages, but with > 4K pages that doesn't give enough data to cover your RAID stripe? > > We're never going to do a variable sized pagecache just because of that. No, we need support for larger page sizes than 16k. 16k has not been fine for a couple of years. We only agreed to 16k because that was the common consensus. Best performance was always at 64k 4 years ago (but then we have no numbers for higher page sizes yet). Now we would prefer much larger sizes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] needs to include
On Sat, 24 Feb 2007 12:22:11 + Ralf Baechle <[EMAIL PROTECTED]> wrote: > sysdev.h uses THIS_MODULE so should include . > > Signed-off-by: Ralf Baechle <[EMAIL PROTECTED]> > > diff --git a/include/linux/sysdev.h b/include/linux/sysdev.h > index 389ccf8..e699ab2 100644 > --- a/include/linux/sysdev.h > +++ b/include/linux/sysdev.h > @@ -22,6 +22,7 @@ > #define _SYSDEV_H_ > > #include > +#include > #include > You can't just make changes like this without a lot of compile testing, I'm afraid. This causes a recursive inclusion and sched.h blows up: In file included from include/linux/utsname.h:35, from include/asm/elf.h:12, from include/linux/elf.h:7, from include/linux/module.h:15, from include/linux/sysdev.h:25, from kernel/time/clocksource.c:28: include/linux/sched.h:1648: warning: 'struct sysdev_class' declared inside parameter list include/linux/sched.h:1648: warning: its scope is only this definition or declaration, which is probably not what you want I think we can fix that by moving the declarations into cpu.h and getting that unpleasant include out of sched.h. Of course, this will probably make other things blow up and additional sysdev.h includes will now be needed. We'll see.. diff -puN include/linux/cpu.h~linux-sysdevh-needs-to-include-linux-moduleh-up-fix include/linux/cpu.h --- a/include/linux/cpu.h~linux-sysdevh-needs-to-include-linux-moduleh-up-fix +++ a/include/linux/cpu.h @@ -41,6 +41,9 @@ extern void cpu_remove_sysdev_attr(struc extern int cpu_add_sysdev_attr_group(struct attribute_group *attrs); extern void cpu_remove_sysdev_attr_group(struct attribute_group *attrs); +extern struct sysdev_attribute attr_sched_mc_power_savings; +extern struct sysdev_attribute attr_sched_smt_power_savings; +extern int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls); #ifdef CONFIG_HOTPLUG_CPU extern void unregister_cpu(struct cpu *cpu); diff -puN include/linux/sched.h~linux-sysdevh-needs-to-include-linux-moduleh-up-fix include/linux/sched.h --- a/include/linux/sched.h~linux-sysdevh-needs-to-include-linux-moduleh-up-fix +++ a/include/linux/sched.h @@ -1642,10 +1642,7 @@ static inline void arch_pick_mmap_layout extern long sched_setaffinity(pid_t pid, cpumask_t new_mask); extern long sched_getaffinity(pid_t pid, cpumask_t *mask); -#include extern int sched_mc_power_savings, sched_smt_power_savings; -extern struct sysdev_attribute attr_sched_mc_power_savings, attr_sched_smt_power_savings; -extern int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls); extern void normalize_rt_tasks(void); _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, Mar 02, 2007 at 02:50:29PM +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 1 Mar 2007 21:11:58 -0800 (PST) > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > The whole DRAM power story is a bedtime story for gullible children. Don't > > fall for it. It's not realistic. The hardware support for it DOES NOT > > EXIST today, and probably won't for several years. And the real fix is > > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which > > is against the whole point of FBDIMM in the first place, but that's what > > you get when you ignore power in the first version!). > > > > Note: > I heard embeded people often designs their own memory-power-off control on > embeded Linux. (but it never seems to be posted to the list.) But I don't know > they are interested in generic memory hotremove or not. > Yes, this is not that uncommon of a thing. People tend to do this in a couple of different ways, in some cases the system is too loaded to ever make doing such a thing at run-time worthwhile, and in those cases these sorts of things tend to be munged in with the suspend code. Unfortunately it tends to be quite difficult in practice to keep pages in one place, so people rely on lame chip-select hacks and limiting the amount of memory that the kernel treats as RAM instead so it never ends up being an issue. Having some sort of a balance would certainly be nice, though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > Just curious .. What does posix_fallocate() return ? bookmark this: http://www.opengroup.org/onlinepubs/009695399/nfindex.html Upon successful completion, posix_fallocate() shall return zero; otherwise, an error number shall be returned to indicate the error. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG 2.6.21-rc2] divide error: 0000
On Thu, Mar 01, 2007 at 11:12:42PM +, Sean Young wrote: > Apologies if this has already been reported. > > If I call clock_gettime(CLOCK_THREAD_CPUTIME_ID, .. ) twice I get: > > divide error: [#1] > Modules linked in: binfmt_misc rfcomm l2cap bluetooth sonypi speedstep_ich > speedstep_lib cpufreq_userspace cpufreq_stats cpufreq_powersave > cpufreq_ondemand freq_table cpufreq_conservative video thermal sbs processor > i2c_ec fan dock button battery ac af_packet ipv6 sbp2 lp usb_storage libusual > orinoco_cs orinoco hermes joydev tsdev usbhid pcmcia e100 mii psmouse > ohci1394 serio_raw yenta_socket rsrc_nonstatic pcmcia_core ieee1394 sr_mod > cdrom sg uhci_hcd parport_pc parport pcspkr evdev usbcore > CPU:0 > EIP:0060:[]Not tainted VLI > EFLAGS: 00010246 (2.6.21-rc2 #1) > EIP is at sample_to_timespec+0x28/0x33 > eax: 63b5a669 ebx: fffa ecx: 63b5a669 edx: fffa > esi: d4a56fa4 edi: 3b9aca00 ebp: d4a56fa4 esp: d4a56f74 > ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 > Process x (pid: 3894, ti=d4a56000 task=dfe9aa50 task.ti=d4a56000) > Stack: fffe c0127d49 d4a56fa4 63b5a669 fffa fffe >0003 d4a56000 c0125bf3 b7f68ff4 b7f9fce0 fffe 0003 >c0103bfc fffe bfd6d5d8 b7f74ff4 0003 bfd6d5b8 0109 > Call Trace: > [] posix_cpu_clock_get+0x47/0xdc > [] sys_clock_gettime+0x80/0x82 > [] syscall_call+0x7/0xb > [] svc_ioctl+0xc2/0x261 > === > Code: 0b eb fe 57 56 53 89 cb 89 d1 8b 74 24 10 83 e0 03 83 f8 02 74 0c 89 f2 > 89 c8 5b 5e 5f e9 ee 3f ff ff bf 00 ca 9a 3b 89 d0 89 da f7 89 56 04 89 > 06 5b 5e 5f c3 55 57 56 53 89 c7 89 d6 89 cb > EIP: [] sample_to_timespec+0x28/0x33 SS:ESP 0068:d4a56f74 > > The instruction is: > > div %edi > > And edi is 1e9 (0x3b9aca00). I don't understand why this results in an > divide error. It does this because 'div' does an unsigned divide of edx:eax by edi. Here, edx=fffa and eax is 63b5a669. Clearly, such a number cannot be divided by 1e9 to return a 32 bits value. Given the values we see here, I suspect the code should have used an integer divide (idiv). This means that something in the code implies that the result is unsigned while it should be signed. Regards, Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, Mar 01, 2007 at 09:53:42PM -0800, Christoph Lameter wrote: > On Fri, 2 Mar 2007, Nick Piggin wrote: > > > > You do not have to deal with TLB entries if you do buffered I/O. > > > > Where does the data come from? > > >From the I/O controller and from the application. Why doesn't the application need to deal with TLB entries? > > > We currently have problems with the kernel limits of 128 SG > > > entries but the fundamental issue is that we can only do 2 Meg of I/O in > > > one go given the default limits of the block layer. Typically the number > > > of hardware SG entrie is also limited. We never will be able to put a > > > > Seems like changing the default limits would be the easiest way to > > fix it then? > > This would only be a temporary fix pushing the limits to the double or so? And using slightly larger page sizes isn't? > > As far as hardware limits go, I don't think you need to scale that > > number linearly with the amount of memory you have, or even with the > > IO throughput. You should reach a point where your command overhead > > is amortised sufficiently, and the controller will be pipelining the > > commands. > > Amortized? The controller still would have to hunt down the 4kb page > pieces that we have to feed him right now. Result: Huge scatter gather > lists that may themselves create issues with higher page order. What sort of numbers do you have for these controllers that aren't very good at doing sg? Isn't the issue was something like your IO controllers have only a limited number of sg entries, which is fine with 16K pages, but with 4K pages that doesn't give enough data to cover your RAID stripe? We're never going to do a variable sized pagecache just because of that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); I am wondering about return values from this syscall ? Is it supposed to return the number of bytes allocated ? What about partial allocations ? What about if the blocks already exists ? What would be return values in those cases ? Just curious .. What does posix_fallocate() return ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.20-rc2] gpio_direction_output() needs an initial value
hi David, > It's been pointed out that output GPIOs should have an initial value, to > avoid signal glitching ... among other things, it can be some time before > a driver is ready. This patch corrects that oversight, fixing For the AT91 changes: Acked-by: Andrew Victor <[EMAIL PROTECTED]> > --- g26.orig/drivers/spi/atmel_spi.c 2007-02-28 12:47:43.0 -0800 > +++ g26/drivers/spi/atmel_spi.c 2007-03-01 15:29:30.0 -0800 > - gpio_direction_output(npcs_pin); > + gpio_direction_output(npcs_pin, !(spi->mode & SPI_CS_HIGH)); > } As mentioned previously (by Walter Tuppa), wouldn't it be better to just change this to: cs_deactivate(spi); Regards, Andrew Victor - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Nick Piggin wrote: > > You do not have to deal with TLB entries if you do buffered I/O. > > Where does the data come from? >From the I/O controller and from the application. > > We currently have problems with the kernel limits of 128 SG > > entries but the fundamental issue is that we can only do 2 Meg of I/O in > > one go given the default limits of the block layer. Typically the number > > of hardware SG entrie is also limited. We never will be able to put a > > Seems like changing the default limits would be the easiest way to > fix it then? This would only be a temporary fix pushing the limits to the double or so? > As far as hardware limits go, I don't think you need to scale that > number linearly with the amount of memory you have, or even with the > IO throughput. You should reach a point where your command overhead > is amortised sufficiently, and the controller will be pipelining the > commands. Amortized? The controller still would have to hunt down the 4kb page pieces that we have to feed him right now. Result: Huge scatter gather lists that may themselves create issues with higher page order. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007 21:11:58 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > The whole DRAM power story is a bedtime story for gullible children. Don't > fall for it. It's not realistic. The hardware support for it DOES NOT > EXIST today, and probably won't for several years. And the real fix is > elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which > is against the whole point of FBDIMM in the first place, but that's what > you get when you ignore power in the first version!). > At first, we have memory hot-add now. So I want to implement hot-removing hot-added memory, at least. (in this case, we don't have to write invasive patches to memory-init-core.) Our(Fujtisu's) product, ia64-NUMA server, has a feature to offline memory. It supports dynamic reconfigraion of nodes, node-hoplug. But there is no *shipped* firmware for hotplug yet. RHEL4 couldn't boot on such hotplug-supported-firmware...so firmware-team were not in hurry. It will be shipped after RHEL5 comes. IMHO, a firmware which supports memory-hot-add are ready to support memory-hot-remove if OS can handle it. Note: I heard embeded people often designs their own memory-power-off control on embeded Linux. (but it never seems to be posted to the list.) But I don't know they are interested in generic memory hotremove or not. Thanks, -Kame - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Nick Piggin wrote: > So what do you mean by efficient? I guess you aren't talking about CPU > efficiency, because even if you make the IO subsystem submit larger > physical IOs, you still have to deal with 256 billion TLB entries, the > pagecache has to deal with 256 billion struct pages, so does the > filesystem code to build the bios. Re the page cache: It needs also to be able to handle large page sizes of course. Scanning gazillions of page structs in vmscan.c will make the system slow as a dog. The number of page structs needs to be drastically reduced for large I/O. I think this can be done with allowing compound pages to be handled throughout the VM. The defrag issues then becomes very pressing indeed. We have discussed the idea of going to kernel with 2M base page size on x86_64 but that step is a bit drastic and the overhead for small files would be tremendous. Support for compound pages already exists in the page allocator and the slab allocator. Maybe we could extend that support to the I/O subsystem? We would also then have more contiguous writes which will further speed up I/O efficiency. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, Mar 01, 2007 at 09:40:45PM -0800, Christoph Lameter wrote: > On Fri, 2 Mar 2007, Nick Piggin wrote: > > > So what do you mean by efficient? I guess you aren't talking about CPU > > efficiency, because even if you make the IO subsystem submit larger > > physical IOs, you still have to deal with 256 billion TLB entries, the > > pagecache has to deal with 256 billion struct pages, so does the > > filesystem code to build the bios. > > You do not have to deal with TLB entries if you do buffered I/O. Where does the data come from? > For mmapped I/O you would want to transparently use 2M TLBs if the > page size is large. > > > So you are having problems with your IO controller's handling of sg > > lists? > > We currently have problems with the kernel limits of 128 SG > entries but the fundamental issue is that we can only do 2 Meg of I/O in > one go given the default limits of the block layer. Typically the number > of hardware SG entrie is also limited. We never will be able to put a Seems like changing the default limits would be the easiest way to fix it then? As far as hardware limits go, I don't think you need to scale that number linearly with the amount of memory you have, or even with the IO throughput. You should reach a point where your command overhead is amortised sufficiently, and the controller will be pipelining the commands. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
belkin bulldog ups monitor vs 2.6.21-rc2
Greetings; I just rebooted to 2.6.21-rc2 and noted that getting x up and running was about 15 seconds longer than usual. When it got a bash shell going I went to it and ran htop which showed that the bulldog monitor was taking 90% of the cpu. Killed it, then restarted it, but when I ran the gui which ran fine and then stopped the gui, the daemon once again went hog wild and had to be killed, and I'm losing my kmail composer focus for 30 seconds at a time now that amanda is making her nightly run. There is nothing in the log about it other than from xinetd as it ran the amanda server stuff. Not quite ready for prime time methinks. Using the ck scheduler, this is terrible performance, virtually no multitasking. Back to 2.6.20-ck1 in the morning if it lives the rest of the night. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Nick Piggin wrote: > So what do you mean by efficient? I guess you aren't talking about CPU > efficiency, because even if you make the IO subsystem submit larger > physical IOs, you still have to deal with 256 billion TLB entries, the > pagecache has to deal with 256 billion struct pages, so does the > filesystem code to build the bios. You do not have to deal with TLB entries if you do buffered I/O. For mmapped I/O you would want to transparently use 2M TLBs if the page size is large. > So you are having problems with your IO controller's handling of sg > lists? We currently have problems with the kernel limits of 128 SG entries but the fundamental issue is that we can only do 2 Meg of I/O in one go given the default limits of the block layer. Typically the number of hardware SG entrie is also limited. We never will be able to put a - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mv643xx ethernet driver
During initialization, mv643xx driver registers IRQ before setting up tx/rx rings. This causes kernel oops because mv643xx_poll, which gets called right after registering IRQ, calls netif_rx_complete, which accesses the rx ring (I don't have the oops message anymore; I just remember this sequence of calls). Attached (tested) patch first initializes the rx/tx rings and then registers the IRQ. Giri Signed-off-by: Giridhar Pemmasani <[EMAIL PROTECTED]> --- drivers/net/mv643xx_eth.c 2006-11-29 16:57:37.0 -0500 +++ ../linux-2.6.20.orig/drivers/net/mv643xx_eth.c 2007-02-23 09:38:21.0 -0500 @@ -778,14 +778,6 @@ unsigned int size; int err; - err = request_irq(dev->irq, mv643xx_eth_int_handler, - IRQF_SHARED | IRQF_SAMPLE_RANDOM, dev->name, dev); - if (err) { - printk(KERN_ERR "Can not assign IRQ number to MV643XX_eth%d\n", - port_num); - return -EAGAIN; - } - eth_port_init(mp); memset(&mp->timeout, 0, sizeof(struct timer_list)); @@ -797,8 +789,7 @@ GFP_KERNEL); if (!mp->rx_skb) { printk(KERN_ERR "%s: Cannot allocate Rx skb ring\n", dev->name); - err = -ENOMEM; - goto out_free_irq; + return -ENOMEM; } mp->tx_skb = kmalloc(sizeof(*mp->tx_skb) * mp->tx_ring_size, GFP_KERNEL); @@ -852,13 +843,8 @@ dev->name, size); printk(KERN_ERR "%s: Freeing previously allocated TX queues...", dev->name); - if (mp->rx_sram_size) - iounmap(mp->p_tx_desc_area); - else - dma_free_coherent(NULL, mp->tx_desc_area_size, - mp->p_tx_desc_area, mp->tx_desc_dma); err = -ENOMEM; - goto out_free_tx_skb; + goto out_free_tx_ring; } memset((void *)mp->p_rx_desc_area, 0, size); @@ -866,6 +852,14 @@ mv643xx_eth_rx_refill_descs(dev); /* Fill RX ring with skb's */ + err = request_irq(dev->irq, mv643xx_eth_int_handler, + IRQF_SHARED | IRQF_SAMPLE_RANDOM, dev->name, dev); + if (err) { + printk(KERN_ERR "Can not assign IRQ number to MV643XX_eth%d\n", + port_num); + goto out_free_rx_ring; + } + /* Clear any pending ethernet port interrupts */ mv_write(MV643XX_ETH_INTERRUPT_CAUSE_REG(port_num), 0); mv_write(MV643XX_ETH_INTERRUPT_CAUSE_EXTEND_REG(port_num), 0); @@ -891,12 +885,22 @@ return 0; +out_free_rx_ring: + if (mp->rx_sram_size) + iounmap(mp->p_rx_desc_area); + else + dma_free_coherent(NULL, mp->rx_desc_area_size, + mp->p_rx_desc_area, mp->rx_desc_dma); +out_free_tx_ring: + if (mp->tx_sram_size) + iounmap(mp->p_tx_desc_area); + else + dma_free_coherent(NULL, mp->tx_desc_area_size, + mp->p_tx_desc_area, mp->tx_desc_dma); out_free_tx_skb: kfree(mp->tx_skb); out_free_rx_skb: kfree(mp->rx_skb); -out_free_irq: - free_irq(dev->irq, dev); return err; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [PATCH 1/2] rcfs core patch
Srivatsa Vaddagiri wrote: Heavily based on Paul Menage's (inturn cpuset) work. The big difference is that the patch uses task->nsproxy to group tasks for resource control purpose (instead of task->containers). The patch retains the same user interface as Paul Menage's patches. In particular, you can have multiple hierarchies, each hierarchy giving a different composition/view of task-groups. (Ideally this patch should have been split into 2 or 3 sub-patches, but will do that on a subsequent version post) With this don't we end up with a lot of duplicate between cpusets and rcfs. Signed-off-by : Srivatsa Vaddagiri <[EMAIL PROTECTED]> Signed-off-by : Paul Menage <[EMAIL PROTECTED]> --- linux-2.6.20-vatsa/include/linux/init_task.h |4 linux-2.6.20-vatsa/include/linux/nsproxy.h |5 linux-2.6.20-vatsa/init/Kconfig | 22 linux-2.6.20-vatsa/init/main.c |1 linux-2.6.20-vatsa/kernel/Makefile |1 --- The diffstat does not look quite right. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch 3 / 3: fix floppy mount bug in kernel 2.6.21-rc1
Andrew, On Thu, Mar 01, 2007 at 04:47:42PM -0800, Andrew Morton wrote: > On Thu, 01 Mar 2007 15:32:22 +0100 > "Uwe Bugla" <[EMAIL PROTECTED]> wrote: > > > Hi folks, > > this patch fixes the floppy mount bug (i. e. regression) in kernel > > 2.6.21-rc1. It was inspired by Stephane Eranian. It was tested on an Intel > > P4 1800 MHz > > (Intel ICH4 chipset) and on an AMD Athlon XP 1800 MHz (Silicon Integrated > > Systems chipset 740, 5513). > > My deep thanks and respect go to: > > Stephane Eranian, Linus Torvalds, Jiri Slaby. You are truthfully real men > > and reliable, accurate, fine chaps. It feels great to have you in this > > world-wide community! > > Would you still call the whole i386 architecture "a small number of > > machines", Mister Andrew Morton? If yes, in how far please? > > > > Signed-off-by: Uwe Bugla <[EMAIL PROTECTED]> > > > > --- a/arch/i386/kernel/process.c > > +++ b/arch/i386/kernel/process.c > > @@ -154,6 +154,7 @@ > > current_thread_info()->status |= TS_POLLING; > > } else { > > /* loop is done by the caller */ > > + local_irq_enable(); > > cpu_relax(); > > } > > } > > Linus reverted the offending patch "[PATCH] i386: add idle notifier" > on Feb 26, so this fix should no longer be needed, and 2.6.21-rc2 should > be working again. > > Hopefully Stephane will fold this fix into any future version of that patch, > if appropriate. Well, given that nobody really liked this idle notifier, I am trying to do differently on all architectures which unfortunately is not an easy thing to do. What I did not really like in all of this is that people come up with arguments without providing the data to prove it, e.g., increase interrupt latency (by how much?). -- -Stephane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
On Thu, Mar 01, 2007 at 09:09:42PM -0800, Andrew Morton wrote: > > or document that drivers need to handle it specially and give them a > > way to find out about them. (Or do the horrible slab refcounting hack > > I wrote up above) > > OK. So you're proposing that XFS and ext3 simply stop sing slab for this > memory? Yes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
Linus Torvalds wrote: > Virtualization in general. We don't know what it is - in IBM machines it's > a hypervisor. With Xen and VMware, it's usually a hypervisor too. With > KVM, it's obviously a host Linux kernel/user-process combination. > > The point being that in the guests, hotunplug is almost useless (for > bigger ranges), and we're much better off just telling the virtualization > hosts on a per-page level whether we care about a page or not, than to > worry about fragmentation. > > And in hosts, we usually don't care EITHER, since it's usually done in a > hypervisor. > The paravirt_ops patches I just posted implement all the machinery required to create a pseudo-physical to machine address mapping under the kernel. This is used under Xen because it directly exposes the pagetables to its guests, but there's no reason why you couldn't use this layer to implement the same mapping without an underlying hypervisor. This allows the kernel to see a normal linear "physical" address space which is in fact its mapped over a discontigious set of machine ("real physical") pages. Andrew and I discussed using it for a kdump kernel, so that you could load it into a random bunch of pages, and set things up so that it sees itself as being contiguous. The mapping is pretty simple. It intercepts __pte (__pmd, etc) to map the "physical" page to the real machine page, and pte_val does the reverse mapping. You could implement this today as a farily simple, thin paravirt_ops backend. The main tricky part is making sure all the device drivers are correct in using bus addresses (which are mapped to real machine addresses), and that they don't assume that adjacent kernel virtual pages are physically adjacent. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007, Andrew Morton wrote: > > On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> > wrote: > > > In other words, I really don't see a huge upside. I see *lots* of > > downsides, but upsides? Not so much. Almost everybody who wants unplug > > wants virtualization, and right now none of the "big virtualization" > > people would want to have kernel-level anti-fragmentation anyway sicne > > they'd need to do it on their own. > > Agree with all that, but you're missing the other application: power > saving. FBDIMMs take eight watts a pop. This is a hardware problem. Let's see how long it takes for Intel to realize that FBDIMM's were a hugely bad idea from a power perspective. Yes, the same issues exist for other DRAM forms too, but to a *much* smaller degree. Also, IN PRACTICE you're never ever going to see this anyway. Almost everybody wants bank interleaving, because it's a huge performance win on many loads. That, in turn, means that your memory will be spread out over multiple DIMM's even for a single page, much less any bigger area. In other words - forget about DRAM power savings. It's not realistic. And if you want low-power, don't use FBDIMM's. It really *is* that simple. (And yes, maybe FBDIMM controllers in a few years won't use 8 W per buffer. I kind of doubt that, since FBDIMM fairly fundamentally is highish voltage swings at high frequencies.) Also, on a *truly* idle system, we'll see the power savings whatever we do, because the working set will fit in D$, and to get those DRAM power savings in reality you need to have the DRAM controller shut down on its own anyway (ie sw would only help a bit). The whole DRAM power story is a bedtime story for gullible children. Don't fall for it. It's not realistic. The hardware support for it DOES NOT EXIST today, and probably won't for several years. And the real fix is elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which is against the whole point of FBDIMM in the first place, but that's what you get when you ignore power in the first version!). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
On Fri, 2 Mar 2007 05:03:51 + Christoph Hellwig <[EMAIL PROTECTED]> wrote: > On Thu, Mar 01, 2007 at 09:00:44PM -0800, Andrew Morton wrote: > > I that case we're talking about different things. > > > > I thought the proposal was to continue to use slab pages, but to take a ref > > on them as they're added to the bio, drop that ref in bi_end_io()? > > That would give you silent memory corruption in case the networking code > hold a reference after the memory gets returned to slab and reused. Well, given that bi_end_io() is called after the "io" has completed, I'm assuming that networking has completely finished with the memory by the time bi_end_io() gets called. I guess one can envisage situations where that might not happen, but they'd be terribly buggy ones, surely. > We need to either stop allowing to pass slab memory to the block layer, > or document that drivers need to handle it specially and give them a > way to find out about them. (Or do the horrible slab refcounting hack > I wrote up above) OK. So you're proposing that XFS and ext3 simply stop sing slab for this memory? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, Mar 01, 2007 at 08:31:24PM -0800, Christoph Lameter wrote: > On Fri, 2 Mar 2007, Nick Piggin wrote: > > > > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel > > > in order to reduce overhead of managing page structs for large I/O and > > > large memory applications. We need appropriate measures to deal with the > > > fragmentation problem. > > > > I don't understand why, out of any architecture, ia64 would have to hack > > around this in software :( > > Ummm... We have x86_64 platforms with the 4k page problem. 4k pages are > very useful for the large number of small files that are around. But for > the large streams of data you would want other methods of handling these. > > If I want to write 1 terabyte (2^50) to disk then the I/O subsystem has > to handle 2^(50-12) = 2^38 = 256 million page structs! This limits I/O > bandwiths and leads to huge scatter gather lists (and we are limited in > terms of the numbe of items on those lists in many drivers). Our future > platforms have up to serveral petabytes of memory. There needs to be some > way to handle these capacities in an efficient way. We cannot wait > an hour for the terabyte to reach the disk. I guess you mean 256 billion page structs. So what do you mean by efficient? I guess you aren't talking about CPU efficiency, because even if you make the IO subsystem submit larger physical IOs, you still have to deal with 256 billion TLB entries, the pagecache has to deal with 256 billion struct pages, so does the filesystem code to build the bios. So you are having problems with your IO controller's handling of sg lists? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
On Thu, Mar 01, 2007 at 09:00:44PM -0800, Andrew Morton wrote: > I that case we're talking about different things. > > I thought the proposal was to continue to use slab pages, but to take a ref > on them as they're added to the bio, drop that ref in bi_end_io()? That would give you silent memory corruption in case the networking code hold a reference after the memory gets returned to slab and reused. We need to either stop allowing to pass slab memory to the block layer, or document that drivers need to handle it specially and give them a way to find out about them. (Or do the horrible slab refcounting hack I wrote up above) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
On Fri, 2 Mar 2007 04:49:10 + Christoph Hellwig <[EMAIL PROTECTED]> wrote: > On Thu, Mar 01, 2007 at 08:48:06PM -0800, Andrew Morton wrote: > > On Fri, 2 Mar 2007 04:30:39 + Christoph Hellwig <[EMAIL PROTECTED]> > > wrote: > > > > > But in this case we'd really need to enforce this, and add a > > > BUG_ON(PageSlab(page)) in bio_add_page to trip everyone submit > > > this kind of pages. > > > > That would be > > > > BUG_ON(PageSlab(page) && page_count(page) == 0)? > > No, all slab pages. Currently they all have a reference count of > zero, but we generally don't want people to pass in pages that > come from a non-refcounted allocator. I that case we're talking about different things. I thought the proposal was to continue to use slab pages, but to take a ref on them as they're added to the bio, drop that ref in bi_end_io()? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007 20:33:04 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Thu, 1 Mar 2007, Andrew Morton wrote: > > > Sorry, but this is crap. zones and nodes are distinct, physical concepts > > and you're kidding yourself if you think you can somehow fudge things to > > make > > one of them just go away. > > > > Think: ZONE_DMA32 on an Opteron machine. I don't think there is a sane way > > in which we can fudge away the distinction between > > bus-addresses-which-have-the-32-upper-bits-zero and > > memory-which-is-local-to-each-socket. > > Of course you can. Add a virtual DMA and DMA32 zone/node and extract the > relevant memory from the base zone/node. You're using terms which I've never seen described anywhere. Please, just stop here. Give us a complete design proposal which we can understand and review. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] md: Fix for raid6 reshape.
### Comments for Changeset Recent patch for raid6 reshape had a change missing that showed up in subsequent review. Many places in the raid5 code used "conf->raid_disks-1" to mean "number of data disks". With raid6 that had to be changed to "conf->raid_disk - conf->max_degraded" or similar. One place was missed. This bug means that if a raid6 reshape were aborted in the middle the recorded position would be wrong. On restart it would either fail (as the position wasn't on an appropriate boundary) or would leave a section of the array unreshaped, causing data corruption. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./drivers/md/raid5.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c --- .prev/drivers/md/raid5.c2007-03-02 15:47:51.0 +1100 +++ ./drivers/md/raid5.c2007-03-02 15:48:35.0 +1100 @@ -3071,7 +3071,7 @@ static sector_t reshape_request(mddev_t release_stripe(sh); } spin_lock_irq(&conf->device_lock); - conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1); + conf->expand_progress = (sector_nr + i) * new_data_disks); spin_unlock_irq(&conf->device_lock); /* Ok, those stripe are ready. We can start scheduling * reads on the source stripes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] coredump: documentation for proc entry
This patch adds the documentation for /proc//coredump_omit_anonymous_shared. Signed-off-by: Hidehiro Kawai <[EMAIL PROTECTED]> --- Documentation/filesystems/proc.txt | 38 +++ 1 files changed, 38 insertions(+) Index: linux-2.6.20-mm2/Documentation/filesystems/proc.txt === --- linux-2.6.20-mm2.orig/Documentation/filesystems/proc.txt +++ linux-2.6.20-mm2/Documentation/filesystems/proc.txt @@ -41,6 +41,7 @@ Table of Contents 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem 2.12 /proc//oom_adj - Adjust the oom-killer score 2.13 /proc//oom_score - Display current oom-killer score + 2.14 /proc//coredump_omit_anonymous_shared - Core dump coordinator -- Preface @@ -1982,6 +1983,43 @@ This file can be used to check the curre any given . Use it together with /proc//oom_adj to tune which process should be killed in an out-of-memory situation. +2.14 /proc//coredump_omit_anonymous_shared - Core dump coordinator +- +When a process is dumped, all anonymous memory is written to a core file as +long as the size of the core file isn't limited. But sometimes we don't want +to dump some memory segments, for example, huge shared memory. + +The /proc//coredump_omit_anonymous_shared is a flag which enables you to +omit anonymous shared memory segments from a core file when it is generated. +When the process is dumped, the core dump routine decides whether a +given memory segment should be dumped into a core file or not based on the +type of the memory segment and the flag. + +If you have written a non-zero value to this proc file, anonymous shared +memory segments are not dumped. There are three types of anonymous shared +memory: + + - IPC shared memory + - the memory segments created by mmap(2) with MAP_ANONYMOUS and MAP_SHARED +flags + - the memory segments created by mmap(2) with MAP_SHARED flag, and the +mapped file has already been unlinked + +Because current core dump routine doesn't distinguish these segments, you can +only choose either dumping all anonymous shared memory segments or not. + +If you don't want to dump all shared memory segments attached to pid 1234, +write 0 to the process's proc file. + + $ echo 1 > /proc/1234/coredump_omit_anonymous_shared + +When a new process is created, the process inherits the flag status from its +parent. It is useful to set the flag before the program runs. +For example: + + $ echo 1 > /proc/self/coredump_omit_anonymous_shared + $ ./some_program + -- Summary -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] coredump: ELF-FDPIC: enable to omit anonymous shared memory
This patch enables to omit anonymous shared memory from an ELF-FDPIC formatted core file when it is generated. The debug messages from maydump() in fs/binfmt_elf_fdpic.c are changed appropriately so that we can know what kind of memory segments are dumped or not. Signed-off-by: Hidehiro Kawai <[EMAIL PROTECTED]> --- fs/binfmt_elf_fdpic.c | 25 - 1 files changed, 16 insertions(+), 9 deletions(-) Index: linux-2.6.20-mm2/fs/binfmt_elf_fdpic.c === --- linux-2.6.20-mm2.orig/fs/binfmt_elf_fdpic.c +++ linux-2.6.20-mm2/fs/binfmt_elf_fdpic.c @@ -1168,7 +1168,7 @@ static int dump_seek(struct file *file, * * I think we should skip something. But I am not sure how. H.J. */ -static int maydump(struct vm_area_struct *vma) +static int maydump(struct vm_area_struct *vma, struct mm_struct *mm) { /* Do not dump I/O mapped devices or special mappings */ if (vma->vm_flags & (VM_IO | VM_RESERVED)) { @@ -1184,15 +1184,22 @@ static int maydump(struct vm_area_struct return 0; } - /* Dump shared memory only if mapped from an anonymous file. */ + /* +* Dump shared memory only if mapped from an anonymous file and +* /proc//coredump_omit_anonymous_shared flag is not set. +*/ if (vma->vm_flags & VM_SHARED) { - if (vma->vm_file->f_path.dentry->d_inode->i_nlink == 0) { + if (vma->vm_file->f_path.dentry->d_inode->i_nlink) { kdcore("%08lx: %08lx: no (share)", vma->vm_start, vma->vm_flags); + return 0; + } + if (mm->coredump_omit_anon_shared) { + kdcore("%08lx: %08lx: no (anon-share)", vma->vm_start, vma->vm_flags); + return 0; + } else { + kdcore("%08lx: %08lx: yes (anon-share)", vma->vm_start, vma->vm_flags); return 1; } - - kdcore("%08lx: %08lx: no (share)", vma->vm_start, vma->vm_flags); - return 0; } #ifdef CONFIG_MMU @@ -1451,7 +1458,7 @@ static int elf_fdpic_dump_segments(struc for (vma = current->mm->mmap; vma; vma = vma->vm_next) { unsigned long addr; - if (!maydump(vma)) + if (!maydump(vma, mm)) continue; for (addr = vma->vm_start; @@ -1506,7 +1513,7 @@ static int elf_fdpic_dump_segments(struc for (vml = current->mm->context.vmlist; vml; vml = vml->next) { struct vm_area_struct *vma = vml->vma; - if (!maydump(vma)) + if (!maydump(vma, mm)) continue; if ((*size += PAGE_SIZE) > *limit) @@ -1715,7 +1722,7 @@ static int elf_fdpic_core_dump(long sign phdr.p_offset = offset; phdr.p_vaddr = vma->vm_start; phdr.p_paddr = 0; - phdr.p_filesz = maydump(vma) ? sz : 0; + phdr.p_filesz = maydump(vma, current->mm) ? sz : 0; phdr.p_memsz = sz; offset += phdr.p_filesz; phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] coredump: ELF: enable to omit anonymous shared memory
This patch enables to omit anonymous shared memory from an ELF formatted core file when it is generated. Signed-off-by: Hidehiro Kawai <[EMAIL PROTECTED]> --- fs/binfmt_elf.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) Index: linux-2.6.20-mm2/fs/binfmt_elf.c === --- linux-2.6.20-mm2.orig/fs/binfmt_elf.c +++ linux-2.6.20-mm2/fs/binfmt_elf.c @@ -1191,9 +1191,15 @@ static int maydump(struct vm_area_struct if (vma->vm_flags & (VM_IO | VM_RESERVED)) return 0; - /* Dump shared memory only if mapped from an anonymous file. */ - if (vma->vm_flags & VM_SHARED) - return vma->vm_file->f_path.dentry->d_inode->i_nlink == 0; + /* +* Dump shared memory only if mapped from an anonymous file and +* /proc//coredump_omit_anonymous_shared flag is not set. +*/ + if (vma->vm_flags & VM_SHARED) { + if (vma->vm_file->f_path.dentry->d_inode->i_nlink) + return 0; + return vma->vm_mm->coredump_omit_anon_shared == 0; + } /* If it hasn't been written to, don't write it out */ if (!vma->anon_vma) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
On Thu, Mar 01, 2007 at 08:48:06PM -0800, Andrew Morton wrote: > On Fri, 2 Mar 2007 04:30:39 + Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > > But in this case we'd really need to enforce this, and add a > > BUG_ON(PageSlab(page)) in bio_add_page to trip everyone submit > > this kind of pages. > > That would be > > BUG_ON(PageSlab(page) && page_count(page) == 0)? No, all slab pages. Currently they all have a reference count of zero, but we generally don't want people to pass in pages that come from a non-refcounted allocator. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] coredump: add an interface to control the core dump routine
This patch adds an interface to set/reset a flag which determines anonymous shared memory segments should be dumped or not when a core file is generated. /proc//coredump_omit_anonymous_shared file is provided to access the flag. You can change the flag status for a particular process by writing to or reading from the file. The flag status is inherited to the child process when it is created. The flag is stored into coredump_omit_anon_shared member of mm_struct, which shares bytes with dumpable member because these two are adjacent bit fields. In order to avoid write-write race between the two, we use a global spin lock. smp_wmb() at updating dumpable is removed because set_dumpable() includes a pair of spin lock and unlock which has the effect of memory barrier. Signed-off-by: Hidehiro Kawai <[EMAIL PROTECTED]> --- fs/exec.c | 12 ++-- fs/proc/base.c | 103 ++ include/linux/binfmts.h |4 + include/linux/sched.h | 33 kernel/fork.c |3 + kernel/sys.c| 62 +++--- security/commoncap.c|2 security/dummy.c|2 8 files changed, 174 insertions(+), 47 deletions(-) Index: linux-2.6.20-mm2/fs/proc/base.c === --- linux-2.6.20-mm2.orig/fs/proc/base.c +++ linux-2.6.20-mm2/fs/proc/base.c @@ -74,6 +74,7 @@ #include #include #include +#include #include "internal.h" /* NOTE: @@ -1753,6 +1754,104 @@ static const struct inode_operations pro #endif +#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE) +static ssize_t proc_coredump_omit_anon_shared_read(struct file *file, + char __user *buf, + size_t count, + loff_t *ppos) +{ + struct task_struct *task = get_proc_task(file->f_dentry->d_inode); + struct mm_struct *mm; + char buffer[PROC_NUMBUF]; + size_t len; + loff_t __ppos = *ppos; + int ret; + + ret = -ESRCH; + if (!task) + goto out_no_task; + + ret = 0; + mm = get_task_mm(task); + if (!mm) + goto out_no_mm; + + len = snprintf(buffer, sizeof(buffer), "%u\n", + mm->coredump_omit_anon_shared); + if (__ppos >= len) + goto out; + if (count > len - __ppos) + count = len - __ppos; + + ret = -EFAULT; + if (copy_to_user(buf, buffer + __ppos, count)) + goto out; + + ret = count; + *ppos = __ppos + count; + + out: + mmput(mm); + out_no_mm: + put_task_struct(task); + out_no_task: + return ret; +} + +static ssize_t proc_coredump_omit_anon_shared_write(struct file *file, + const char __user *buf, + size_t count, + loff_t *ppos) +{ + struct task_struct *task; + struct mm_struct *mm; + char buffer[PROC_NUMBUF], *end; + unsigned int val; + int ret; + + ret = -EFAULT; + memset(buffer, 0, sizeof(buffer)); + if (count > sizeof(buffer) - 1) + count = sizeof(buffer) - 1; + if (copy_from_user(buffer, buf, count)) + goto out_no_task; + + ret = -EINVAL; + val = (unsigned int)simple_strtoul(buffer, &end, 0); + if (*end == '\n') + end++; + if (end - buffer == 0) + goto out_no_task; + + ret = -ESRCH; + task = get_proc_task(file->f_dentry->d_inode); + if (!task) + goto out_no_task; + + ret = end - buffer; + mm = get_task_mm(task); + if (!mm) + goto out_no_mm; + + if (down_write_trylock(&coredump_settings_sem)) { + set_coredump_omit_anon_shared(mm, (val != 0)); + up_write(&coredump_settings_sem); + } else + ret = -EBUSY; + + mmput(mm); + out_no_mm: + put_task_struct(task); + out_no_task: + return ret; +} + +static struct file_operations proc_coredump_omit_anon_shared_operations = { + .read = proc_coredump_omit_anon_shared_read, + .write = proc_coredump_omit_anon_shared_write, +}; +#endif + /* * /proc/self: */ @@ -1972,6 +2071,10 @@ static struct pid_entry tgid_base_stuff[ #ifdef CONFIG_FAULT_INJECTION REG("make-it-fail", S_IRUGO|S_IWUSR, fault_inject), #endif +#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE) + REG("coredump_omit_anonymous_shared", S_IRUGO|S_IWUSR, + coredump_omit_anon_shared), +#endif #ifdef CONFIG_TASK_IO_ACCOUNTING INF("io", S_IRUGO, pid_io_accounting), #endif Index: linux-2.6.20-mm2/include/linux/sched.h ==
Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
On Fri, 2 Mar 2007 04:30:39 + Christoph Hellwig <[EMAIL PROTECTED]> wrote: > But in this case we'd really need to enforce this, and add a > BUG_ON(PageSlab(page)) in bio_add_page to trip everyone submit > this kind of pages. That would be BUG_ON(PageSlab(page) && page_count(page) == 0)? > > So we have a few options to look at: > > > > a) kludge things in AOE. Unpleasing, and might cause memory leaks > >(although it won't, because the caller hasn't run bi_end_io yet). > > > > b) Take a ref on slab pages in slab. A bit costly, perhaps. > > > > c) teach ext3 and XFS to take a ref on these pages as they are added to > >the BIOs, undo that ref in bi_end_io. > > > > I think c)? > > Yes. I'm perfectly fine with this as long as we document and enforce > this. And write the patch ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug in on_each_cpu?
On Thursday, 1-Mar-2007 at 7:22 PST, Andrew Morton wrote: > On Thu, 01 Mar 2007 03:47:39 -0800 Zachary Amsden <[EMAIL PROTECTED]> wrote: > > > Rusty Russell wrote: > > > On Thu, 2007-03-01 at 03:34 -0800, Zachary Amsden wrote: > > > > > >> What would be really, really nice would be to statically check all > > >> callsites that issue irq disables actually keep irqs disabled. > > >> Presumably, there was a reason they disabled irqs, and re-enabling them > > >> underneath their noses, even if it is to avoid a race, breaks the logic > > >> behind that reason. > > >> > > > > > > For the moment, how about a BUG_ON() in on_each_cpu()? > > > > > > > Sounds quite decent. But why does on_each_cpu need to disable > > interrupts at all? It just calls func(), then re-enables interrupts. > > So whatever was going to happen during func() that might not be > > interrupt safe could just be done in the callee, avoiding the rather > > expensive mess of disabling and re-enabling interrupts for those cases > > where it doesn't matter. > > The handler for smp_call_function() is called with local interrupts > disabled (from the IPI handler). > > So to provide a consistent call environment for that handler, on_each_cpu() > will also disable local interrupts when making the direct call on this CPU. And further, this "consistent call environment" is *required* for correct operation of certain callers, e.g. invalidate_bh_lrus(), whose callback function is invalidate_bh_lru(). If invalidate_bh_lru() is called without IRQs blocked, it might be interrupted by an IPI that causes nested execution of that same function on behalf of another cpu's call to on_each_cpu(), and this can lead to duplicate brelse() calls on a buf head (and ultimately to ext3 journaling crashes due to invalid concurrent use of that buf head). Cheers. -ernie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] coredump: core dump masking support v4
Hi, This patch series is version 4 of the core dump masking feature, which provides a per-process flag not to dump anonymous shared memory segments. In the previous version, the flag value was passed around the core dump functions as an argument to use the same setting while dumping. In this version, instead of doing that, a r/w semaphore prevents the setting from being changed while dumping. This patch series can be applied against 2.6.20-mm2. The supported core file formats are ELF and ELF-FDPIC. ELF has been tested, but ELF-FDPIC has not been built and tested because I don't have the test environment. Background: Some software programs share huge memory among hundreds of processes. If a failure occurs on one of these processes, they can be signaled by a monitoring process to generate core files and restart the service. However, it can develop into a system-wide failure such as system slow down for a long time and disk space shortage because the total size of the core files is very huge! To avoid the above situation we can limit the core file size by setrlimit(2) or ulimit(1). But this method can lose important data such as stack because core dumping is terminated halfway. So I suggest keeping shared memory segments from being dumped for particular processes. Because the shared memory attached to processes is common in them, we don't need to dump the shared memory every time. Usage: Get all shared memory segments of pid 1234 not to dump: $ echo 1 > /proc/1234/coredump_omit_anonymous_shared When a new process is created, the process inherits the flag status from its parent. It is useful to set the core dump flags before the program runs. For example: $ echo 1 > /proc/self/coredump_omit_anonymous_shared $ ./some_program ChangeLog: v4: - in maydump(), retrieve the core dump setting from mm_struct directly, instead of its additional argument - writing to /proc//coredump_omit_anonymous_shared returns EBUSY while core dumping. v3: http://groups.google.com/group/linux.kernel/browse_frm/thread/706d2ae41c1cb2de/ - remove `/proc//core_flags' proc entry - add `/proc//coredump_anonymous_shared' as a named flag - remove kernel.core_flags_enable sysctl parameter v2: http://groups.google.com/group/linux.kernel/browse_frm/thread/cb254465971d4a42/ http://groups.google.com/group/linux.kernel/browse_frm/thread/da78f2702e06fa11/ - rename `coremask' to `core_flags' - change `core_flags' member in mm_struct to a bit field next to `dumpable' - introduce a global spin lock to protect adjacent two bit fields (core_flags and dumpable) from race condition - fix a bug that the generated core file can be corrupted when core dumping and updating core_flags occur concurrently - add kernel.core_flags_enable sysctl parameter to enable/disable flags in /proc//core_flags - support ELF-FDPIC binary format, but not tested v1: http://groups.google.com/group/linux.kernel/browse_frm/thread/1381fc54d716e3e6/ -- Hidehiro Kawai Hitachi, Ltd., Systems Development Laboratory E-mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Nick Piggin wrote: > > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel > > in order to reduce overhead of managing page structs for large I/O and > > large memory applications. We need appropriate measures to deal with the > > fragmentation problem. > > I don't understand why, out of any architecture, ia64 would have to hack > around this in software :( Ummm... We have x86_64 platforms with the 4k page problem. 4k pages are very useful for the large number of small files that are around. But for the large streams of data you would want other methods of handling these. If I want to write 1 terabyte (2^50) to disk then the I/O subsystem has to handle 2^(50-12) = 2^38 = 256 million page structs! This limits I/O bandwiths and leads to huge scatter gather lists (and we are limited in terms of the numbe of items on those lists in many drivers). Our future platforms have up to serveral petabytes of memory. There needs to be some way to handle these capacities in an efficient way. We cannot wait an hour for the terabyte to reach the disk. > > We need to reduce the real hardware zones as much as possible. Most high > > performance architectures have no need for additional DMA zones f.e. and > > do not have to deal with the complexities that arise there. > > And then you want to add something else on top of them? zones are basically managing a number of MAX_ORDER chunks. The adding of something here is dealing with the categorization of these MAX_ORDER chunks in order to insure movability and thus defragmentability of most of them. Or the upper layer may limit the number of those chunks assigned to a certain container. > > Yes that would mean merging nodes and zones. So "nones". > > Yes, this is what Andrew just said. But you then wanted to add virtual zones > or something on top. I just don't understand why. You agree that merging > nodes and zones is a good idea. Did I miss the important post where some > bright person discovered why merging zones and "virtual zones" is a bad > idea? Hmmm.. I usually talk about the "virtual zones" as virtual nodes. But we are basically at the same point there. Node level controls and APIs exist and can even be used from user space. A container could just be a special node and then the allocations to this container could be controlled via the existing APIs. A virtual zone/node would be assigned a number of MAX_ORDER blocks from real zones/nodes. Then it may hopefully be managed like a real node. In the original zone/node these MAX_ORDER blocks would show up as unavailable. The "upper" layer therefore is the existing node/zone layer. The virtual zones/nodes just steal memory from the real ones. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007, Andrew Morton wrote: > Sorry, but this is crap. zones and nodes are distinct, physical concepts > and you're kidding yourself if you think you can somehow fudge things to make > one of them just go away. > > Think: ZONE_DMA32 on an Opteron machine. I don't think there is a sane way > in which we can fudge away the distinction between > bus-addresses-which-have-the-32-upper-bits-zero and > memory-which-is-local-to-each-socket. Of course you can. Add a virtual DMA and DMA32 zone/node and extract the relevant memory from the base zone/node. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 003 of 3] knfsd: Remove CONFIG_IPV6 ifdefs from sunrpc server code.
They don't really save that much, and aren't worth the hassle. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./include/linux/sunrpc/svc.h |2 -- ./net/sunrpc/svcsock.c | 13 +++-- 2 files changed, 3 insertions(+), 12 deletions(-) diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h --- .prev/include/linux/sunrpc/svc.h2007-03-02 14:20:13.0 +1100 +++ ./include/linux/sunrpc/svc.h2007-03-02 15:14:11.0 +1100 @@ -194,9 +194,7 @@ static inline void svc_putu32(struct kve union svc_addr_u { struct in_addr addr; -#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) struct in6_addraddr6; -#endif }; /* diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c --- .prev/net/sunrpc/svcsock.c 2007-03-02 15:12:52.0 +1100 +++ ./net/sunrpc/svcsock.c 2007-03-02 15:14:11.0 +1100 @@ -131,13 +131,13 @@ static char *__svc_print_addr(struct soc NIPQUAD(((struct sockaddr_in *) addr)->sin_addr), htons(((struct sockaddr_in *) addr)->sin_port)); break; -#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + case AF_INET6: snprintf(buf, len, "%x:%x:%x:%x:%x:%x:%x:%x, port=%u", NIP6(((struct sockaddr_in6 *) addr)->sin6_addr), htons(((struct sockaddr_in6 *) addr)->sin6_port)); break; -#endif + default: snprintf(buf, len, "unknown address type: %d", addr->sa_family); break; @@ -449,9 +449,7 @@ svc_wake_up(struct svc_serv *serv) union svc_pktinfo_u { struct in_pktinfo pkti; -#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) struct in6_pktinfo pkti6; -#endif }; static void svc_set_cmsg_data(struct svc_rqst *rqstp, struct cmsghdr *cmh) @@ -467,7 +465,7 @@ static void svc_set_cmsg_data(struct svc cmh->cmsg_len = CMSG_LEN(sizeof(*pki)); } break; -#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + case AF_INET6: { struct in6_pktinfo *pki = CMSG_DATA(cmh); @@ -479,7 +477,6 @@ static void svc_set_cmsg_data(struct svc cmh->cmsg_len = CMSG_LEN(sizeof(*pki)); } break; -#endif } return; } @@ -730,13 +727,11 @@ static inline void svc_udp_get_dest_addr rqstp->rq_daddr.addr.s_addr = pki->ipi_spec_dst.s_addr; break; } -#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) case AF_INET6: { struct in6_pktinfo *pki = CMSG_DATA(cmh); ipv6_addr_copy(&rqstp->rq_daddr.addr6, &pki->ipi6_addr); break; } -#endif } } @@ -976,11 +971,9 @@ static inline int svc_port_is_privileged case AF_INET: return ntohs(((struct sockaddr_in *)sin)->sin_port) < PROT_SOCK; -#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) case AF_INET6: return ntohs(((struct sockaddr_in6 *)sin)->sin6_port) < PROT_SOCK; -#endif default: return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
On Thu, Mar 01, 2007 at 07:22:45PM -0800, Andrew Morton wrote: > Well I spose slab _could_ take a ref on these pages. What it would need to do is: - add a reference for every object touching this page - don't give the page back to the page allocator or reuse any single object inside it until there are no more reference to the page. I don't think this is a very good idea, although the netowkring references tend to be rather short-term once making this not a that bad burden. > Networking internally maintains caller memory lifetimes, and it assumes > that the caller allocated memory via __alloc_pages() - because it uses > get_page() and put_page(). > > BIO, however, does not internally manage caller memory lifetime. This is > because the caller's ->bi_end_io is always called, so the caller can do it. > > So where we've come unstuck is in a module which has gone and fed BIO > memory into networking. The differing design philosophies are clashing. > > I'm surprised this doesn't happen in other places - aren't there any other > drivers which take a BIO and stuff it down the network? > > Anyway, where's the bug? > > Really, I'd say it's XFS (and ext3). Even though BIO doesn't presently > manage page lifetimes, it _could_. After all, the function is called > bio_add_page(), not bio_add_virtual_address(). It's a bit hacky to kmalloc > some memory, run virt_to_page() and to then present that page to BIO even > though the caller (thanks to the slab optimisation) doesn't actually have > control of that page's lifetime. That was the conclusion I came to when this was brought up initially. Fixing up XFS would be easyish and only waste a tiny amount of memory, and the same is true for ext3 (I did in fact suggest just using get_free_page for this case but got shot down for stupid reasons when the slab debug alignment issues in that area came up) But in this case we'd really need to enforce this, and add a BUG_ON(PageSlab(page)) in bio_add_page to trip everyone submit this kind of pages. > So we have a few options to look at: > > a) kludge things in AOE. Unpleasing, and might cause memory leaks >(although it won't, because the caller hasn't run bi_end_io yet). > > b) Take a ref on slab pages in slab. A bit costly, perhaps. > > c) teach ext3 and XFS to take a ref on these pages as they are added to >the BIOs, undo that ref in bi_end_io. > > I think c)? Yes. I'm perfectly fine with this as long as we document and enforce this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 002 of 3] knfsd: Avoid checksum checks when collecting metadata for a UDP packet.
When recv_msg is called with a size of 0 and MSG_PEEK (and sunrpc/svcsock.c does), it is clear that we only interested in metadata (from/to addresses) and not the data, so don't do any checksum checking at this point. Leave that until the data is requested. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./net/ipv4/udp.c |3 +++ ./net/ipv6/udp.c |4 2 files changed, 7 insertions(+) diff .prev/net/ipv4/udp.c ./net/ipv4/udp.c --- .prev/net/ipv4/udp.c2007-03-02 14:20:13.0 +1100 +++ ./net/ipv4/udp.c2007-03-02 15:13:50.0 +1100 @@ -846,6 +846,9 @@ try_again: goto csum_copy_err; copy_only = 1; } + if (len == 0 && (flags & MSG_PEEK)) + /* avoid checksum concerns when just getting metadata */ + copy_only = 1; if (copy_only) err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), diff .prev/net/ipv6/udp.c ./net/ipv6/udp.c --- .prev/net/ipv6/udp.c2007-03-02 14:20:13.0 +1100 +++ ./net/ipv6/udp.c2007-03-02 15:13:50.0 +1100 @@ -151,6 +151,10 @@ try_again: copy_only = 1; } + if (len == 0 && (flags & MSG_PEEK)) + /* avoid checksum concerns when just getting metadata */ + copy_only = 1; + if (copy_only) err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov, copied ); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007 20:06:25 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > No merge them to one thing and handle them as one. No difference between > zones and nodes anymore. Sorry, but this is crap. zones and nodes are distinct, physical concepts and you're kidding yourself if you think you can somehow fudge things to make one of them just go away. Think: ZONE_DMA32 on an Opteron machine. I don't think there is a sane way in which we can fudge away the distinction between bus-addresses-which-have-the-32-upper-bits-zero and memory-which-is-local-to-each-socket. No matter how hard those hands are waving. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of code-copying
The sunrpc server code needs to know the source and destination address for UDP packets so it can reply properly. It currently copies code out of the network stack to pick the pieces out of the skb. This is ugly and causes compile problems with the IPv6 stuff. So, rip that out and use recv_msg instead. This is a much cleaner interface, but has a slight cost in that the checksum is now checked before the copy, so we don't benefit from doing both at the same time. This can probably be fixed. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./net/sunrpc/svcsock.c | 63 - 1 file changed, 31 insertions(+), 32 deletions(-) diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c --- .prev/net/sunrpc/svcsock.c 2007-03-02 14:20:14.0 +1100 +++ ./net/sunrpc/svcsock.c 2007-03-02 15:12:52.0 +1100 @@ -721,45 +721,23 @@ svc_write_space(struct sock *sk) } } -static void svc_udp_get_sender_address(struct svc_rqst *rqstp, - struct sk_buff *skb) +static inline void svc_udp_get_dest_address(struct svc_rqst *rqstp, + struct cmsghdr *cmh) { switch (rqstp->rq_sock->sk_sk->sk_family) { case AF_INET: { - /* this seems to come from net/ipv4/udp.c:udp_recvmsg */ - struct sockaddr_in *sin = svc_addr_in(rqstp); - - sin->sin_family = AF_INET; - sin->sin_port = skb->h.uh->source; - sin->sin_addr.s_addr = skb->nh.iph->saddr; - rqstp->rq_addrlen = sizeof(struct sockaddr_in); - /* Remember which interface received this request */ - rqstp->rq_daddr.addr.s_addr = skb->nh.iph->daddr; - } + struct in_pktinfo *pki = CMSG_DATA(cmh); + rqstp->rq_daddr.addr.s_addr = pki->ipi_spec_dst.s_addr; break; + } #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) case AF_INET6: { - /* this is derived from net/ipv6/udp.c:udpv6_recvmesg */ - struct sockaddr_in6 *sin6 = svc_addr_in6(rqstp); - - sin6->sin6_family = AF_INET6; - sin6->sin6_port = skb->h.uh->source; - sin6->sin6_flowinfo = 0; - sin6->sin6_scope_id = 0; - if (ipv6_addr_type(&sin6->sin6_addr) & - IPV6_ADDR_LINKLOCAL) - sin6->sin6_scope_id = IP6CB(skb)->iif; - ipv6_addr_copy(&sin6->sin6_addr, - &skb->nh.ipv6h->saddr); - rqstp->rq_addrlen = sizeof(struct sockaddr_in); - /* Remember which interface received this request */ - ipv6_addr_copy(&rqstp->rq_daddr.addr6, - &skb->nh.ipv6h->saddr); - } + struct in6_pktinfo *pki = CMSG_DATA(cmh); + ipv6_addr_copy(&rqstp->rq_daddr.addr6, &pki->ipi6_addr); break; + } #endif } - return; } /* @@ -771,7 +749,15 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) struct svc_sock *svsk = rqstp->rq_sock; struct svc_serv *serv = svsk->sk_server; struct sk_buff *skb; + charbuffer[CMSG_SPACE(sizeof(union svc_pktinfo_u))]; + struct cmsghdr *cmh = (struct cmsghdr *)buffer; int err, len; + struct msghdr msg = { + .msg_name = svc_addr(rqstp), + .msg_control = cmh, + .msg_controllen = sizeof(buffer), + .msg_flags = MSG_DONTWAIT, + }; if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags)) /* udp sockets need large rcvbuf as all pending @@ -797,7 +783,9 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) } clear_bit(SK_DATA, &svsk->sk_flags); - while ((skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err)) == NULL) { + while ((err == kernel_recvmsg(svsk->sk_sock, &msg, NULL, + 0, 0, MSG_PEEK)) < 0 || + (skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err)) == NULL) { if (err == -EAGAIN) { svc_sock_received(svsk); return err; @@ -805,6 +793,7 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) /* possibly an icmp error */ dprintk("svc: recvfrom returned error %d\n", -err); } + rqstp->rq_addrlen = sizeof(rqstp->rq_addr); if (skb->tstamp.off_sec == 0) { struct timeval tv; @@ -827,7 +816,7 @@ svc_udp_recvfrom(struct svc_rqst *rqstp) rqstp->rq_prot = IPPROTO_UDP; - svc_udp_get_se
[PATCH 000 of 3] knfsd: Resolve IPv6 related link error
Current mainline has a compile linkage problem if both CONFIG_IPV6=m CONFIG_SUNRPC=y because net/sunrpc/svcsock.c conditionally used a function defined in the IPv6 module. These three patches resolve the issue. The problem is caused because svcsock needs to get the source and destination address for a udp packet, but doesn't want to just use sock_recvmsg like userspace would as it wants to be able to use the data directly out of the skbuff rather than copying it (when practical). Currently it copies code from udp.c (both ipv4/ and ipv6/) and this causes the problem. This patch changes it to use kernel_recvmsg with a length of 0 and flags of MSG_PEEK to get the addresses but leave the data untouched. A small problem here is that kernel_recvmsg always checks the checksum, so in the case of a large packet we will check the checksum at a different time to when we copy it out into a buffer, which is not ideal. So the second patch of this series avoids the check when recv_msg is called with size==0 and flags==MSG_PEEK. This change should be acked by someone on netdev before going upsteam!!! The rest of the series is still appropriate without the patch, it is just a small optimisation. Finally the last patch removes all the #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) from sunrpc as it really isn't needed and just hides this sort of problem. Patches 1 and 3 are suitable for 2.6.21. Patch 2 needs confirmation. Thanks, NeilBrown [PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of code-copying [PATCH 002 of 3] knfsd: Avoid checksum checks when collecting metadata for a UDP packet. [PATCH 003 of 3] knfsd: Remove CONFIG_IPV6 ifdefs from sunrpc server code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, Mar 02, 2007 at 04:57:51AM +0100, Nick Piggin wrote: > On Thu, Mar 01, 2007 at 07:05:48PM -0800, Christoph Lameter wrote: > > On Thu, 1 Mar 2007, Andrew Morton wrote: > > > For prioritisation purposes I'd judge that memory hot-unplug is of similar > > > value to the antifrag work (because memory hot-unplug permits DIMM > > > poweroff). > > > > I would say that anti-frag / defrag enables memory unplug. > > Well that really depends. If you want to have any sort of guaranteed > amount of unplugging or shrinking (or hugepage allocating), then antifrag > doesn't work because it is a heuristic. > > One thing that worries me about anti-fragmentation is that people might > actually start _using_ higher order pages in the kernel. Then fragmentation > comes back, and it's worse because now it is not just the fringe hugepage or > unplug users (who can anyway work around the fragmentation by allocating > from reserve zones). > There's two sides to that, the ability to use higher order pages in the kernel also means that it's possible to use larger TLB entries while keeping the base page size small, too. There are already many places in the kernel that attempt to use the largest possible size when setting up the entries, and this is something that those of us with tiny software-managed TLBs are a huge fan of -- some platforms have even opted to do perverse things such as scanning for contiguous PTEs and bumping to the next order automatically at set_pte() time. Unplug is also interesting from a power management point of view. Powering off is still more attractive than self-refresh, for example, but could also be used at run-time depending on the workload. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc1 and 2.6.21-rc2 kwin dies silently
Avi Kivity wrote: Sid Boyce wrote: > That's very much appreciated. The point is that all vanilla kernels up > to 2.6.20+ have not had the problems now seen on 2.6.20-rc1 and > 2.6.20-rc2 and like other problems reported, sic framebuffer, etc., > there is a distinct likelihood that it's related to those kernels and > worth reporting here where it will also be seen by the openSUSE kernel > developers. Try running an strace on kwin and reporting the result. Modified /opt/kde3/bin/startkde as below, but got no output, not even an empty file. strace -s 256 -f kwin --lock -o /home/lancelot/KWIN.out & Perhaps that line is never executed. Try running kwin from your konsole after it dies, with the strace of course. Oh, and put the '-o ...' before the kwin command, not after. Oops!, above text should read the same as the subject line, problems seen on 2.6.21-rc1 and 2.6.21-rc2. The strace is huge 2737627 2007-03-02 03:28 KWIN.out. Further digging shows kwin, kicker and klauncher and perhaps other kdeinit stuff also die - no desktop icons after those 3 are started from the commandline. Moving kdesktop_lock out of /opt/kde3/bin, everything comes back after the video is blanked -- no password required. I shall run like that (2.6.21-rc2-git1 currently) and wait for openSUSE to upgrade to 2.6.21. I can send the straces of kicker and kwin on if you think it's still worth it. Thanks and Regards Sid. -- Sid Boyce ... Hamradio License G3VBV, Licensed Private Pilot Emeritus IBM/Amdahl Mainframes and Sun/Fujitsu Servers Tech Support Specialist, Cricket Coach Microsoft Windows Free Zone - Linux used for all Computing Tasks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, Mar 01, 2007 at 08:06:25PM -0800, Christoph Lameter wrote: > On Fri, 2 Mar 2007, Nick Piggin wrote: > > > > I would say that anti-frag / defrag enables memory unplug. > > > > Well that really depends. If you want to have any sort of guaranteed > > amount of unplugging or shrinking (or hugepage allocating), then antifrag > > doesn't work because it is a heuristic. > > We would need additional measures such as real defrag and make more > structure movable. > > > One thing that worries me about anti-fragmentation is that people might > > actually start _using_ higher order pages in the kernel. Then fragmentation > > comes back, and it's worse because now it is not just the fringe hugepage or > > unplug users (who can anyway work around the fragmentation by allocating > > from reserve zones). > > Yes, we (SGI) need exactly that: Use of higher order pages in the kernel > in order to reduce overhead of managing page structs for large I/O and > large memory applications. We need appropriate measures to deal with the > fragmentation problem. I don't understand why, out of any architecture, ia64 would have to hack around this in software :( > > > Thats a value judgement that I doubt. Zone based balancing is bad and has > > > been repeatedly patched up so that it works with the usual loads. > > > > Shouldn't we fix it instead of deciding it is broken and add another layer > > on top that supposedly does better balancing? > > We need to reduce the real hardware zones as much as possible. Most high > performance architectures have no need for additional DMA zones f.e. and > do not have to deal with the complexities that arise there. And then you want to add something else on top of them? > > But just because zones are hardware _now_ doesn't mean they have to stay > > that way. The upshot is that a lot of work for zones is already there. > > Well you cannot get there without the nodes. The control of memory > allocations with user space support etc only comes with the nodes. > > > > A. moveable/unmovable > > > B. DMA restrictions > > > C. container assignment. > > > > There are alternatives to adding a new layer of virtual zones. We could try > > using zones, enven. > > No merge them to one thing and handle them as one. No difference between > zones and nodes anymore. > > > zones aren't perfect right now, but they are quite similar to what you > > want (ie. blocks of memory). I think we should first try to generalise what > > we have rather than adding another layer. > > Yes that would mean merging nodes and zones. So "nones". Yes, this is what Andrew just said. But you then wanted to add virtual zones or something on top. I just don't understand why. You agree that merging nodes and zones is a good idea. Did I miss the important post where some bright person discovered why merging zones and "virtual zones" is a bad idea? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] Vmi fix highpte
Jeremy Fitzhardinge wrote: > Hm, I don't think this interface will work for Xen. In Xen, whenever a > pagetable page gets mapped, it must be mapped RO. map_pt_hook gets > called after the mapping has already been created, so its too late for Xen. > > I was planning on adding kmap_atomic_pte() for use in pte_offset_map*(), > which would be wired through to paravirt_ops to allow Xen to make this a > RO mapping. Would this be sufficient for you to do your vmi thing? > Something like this (compiled, untested). J diff -r 972e84c265cf arch/i386/kernel/paravirt.c --- a/arch/i386/kernel/paravirt.c Thu Mar 01 19:12:49 2007 -0800 +++ b/arch/i386/kernel/paravirt.c Thu Mar 01 19:38:42 2007 -0800 @@ -32,6 +32,7 @@ #include #include #include +#include /* nop stub */ void _paravirt_nop(void) @@ -605,6 +606,8 @@ struct paravirt_ops paravirt_ops = { .kpte_clear_flush = native_kpte_clear_flush, + .kmap_atomic_pte = native_kmap_atomic_pte, + #ifdef CONFIG_X86_PAE .set_pte_atomic = native_set_pte_atomic, .set_pte_present = native_set_pte_present, diff -r 972e84c265cf arch/i386/mm/highmem.c --- a/arch/i386/mm/highmem.cThu Mar 01 19:12:49 2007 -0800 +++ b/arch/i386/mm/highmem.cThu Mar 01 19:38:42 2007 -0800 @@ -26,7 +26,7 @@ void kunmap(struct page *page) * However when holding an atomic kmap is is not legal to sleep, so atomic * kmaps are appropriate for short, tight code paths only. */ -void *kmap_atomic(struct page *page, enum km_type type) +void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot) { enum fixed_addresses idx; unsigned long vaddr; @@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu return page_address(page); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); + set_pte(kmap_pte-idx, mk_pte(page, prot)); return (void*) vaddr; +} + +void *kmap_atomic(struct page *page, enum km_type type) +{ + return _kmap_atomic(page, type, kmap_prot); } void kunmap_atomic(void *kvaddr, enum km_type type) diff -r 972e84c265cf arch/i386/xen/enlighten.c --- a/arch/i386/xen/enlighten.c Thu Mar 01 19:12:49 2007 -0800 +++ b/arch/i386/xen/enlighten.c Thu Mar 01 19:38:42 2007 -0800 @@ -24,6 +24,7 @@ #include #include #include +#include #include "xen-ops.h" #include "mmu.h" @@ -499,6 +500,11 @@ static void xen_release_pt(u32 pfn) ClearPagePinned(page); make_lowmem_page_readwrite(__va(PFN_PHYS(pfn))); } +} + +static void *xen_kmap_atomic_pte(struct page *page, enum km_type type) +{ + return _kmap_atomic(page, type, PAGE_KERNEL_RO); } static __init void xen_pagetable_setup_start(pgd_t *base) @@ -688,6 +694,8 @@ static const struct paravirt_ops xen_par .kpte_clear_flush = xen_kpte_clear_flush, + .kmap_atomic_pte = xen_kmap_atomic_pte, + .pte_val = xen_pte_val, .pgd_val = xen_pgd_val, diff -r 972e84c265cf include/asm-i386/highmem.h --- a/include/asm-i386/highmem.hThu Mar 01 19:12:49 2007 -0800 +++ b/include/asm-i386/highmem.hThu Mar 01 19:38:42 2007 -0800 @@ -24,6 +24,7 @@ #include #include #include +#include /* declarations for highmem.c */ extern unsigned long highstart_pfn, highend_pfn; @@ -67,10 +68,20 @@ extern void FASTCALL(kunmap_high(struct void *kmap(struct page *page); void kunmap(struct page *page); +void *_kmap_atomic(struct page *page, enum km_type type, pgprot_t prot); void *kmap_atomic(struct page *page, enum km_type type); void kunmap_atomic(void *kvaddr, enum km_type type); void *kmap_atomic_pfn(unsigned long pfn, enum km_type type); struct page *kmap_atomic_to_page(void *ptr); + +static inline void *native_kmap_atomic_pte(struct page *page, enum km_type type) +{ + return kmap_atomic(page, type); +} + +#ifndef CONFIG_PARAVIRT +#define kmap_atomic_pte(page, type)native_kmap_atomic_pte(page, type) +#endif #define flush_cache_kmaps()do { } while (0) diff -r 972e84c265cf include/asm-i386/paravirt.h --- a/include/asm-i386/paravirt.h Thu Mar 01 19:12:49 2007 -0800 +++ b/include/asm-i386/paravirt.h Thu Mar 01 19:38:42 2007 -0800 @@ -15,6 +15,9 @@ #ifndef __ASSEMBLY__ #include +#include + +struct page; #define paravirt_type(type)[paravirt_typenum] "i" (type) #define paravirt_clobber(clobber) [paravirt_clobber] "i" (clobber) @@ -372,6 +375,8 @@ struct paravirt_ops pte_t (*ptep_get_and_clear)(pte_t *ptep); + void *(*kmap_atomic_pte)(struct page *page, enum km_type type); + #ifdef CONFIG_X86_PAE void (*set_pte_atomic)(pte_t *ptep, pte_t pteval); void (*set_pte_present)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte); @@ -695,6 +700,13 @@ static inline void paravirt_init_pda(str #define paravirt_alloc_pd_clone(pfn, clonepfn, start, count) \ PVOP_VCALL4(alloc_pd_cl
Re: The performance and behaviour of the anti-fragmentation related patches
Linus Torvalds wrote: On Fri, 2 Mar 2007, Balbir Singh wrote: My personal opinion is that while I'm not a huge fan of virtualization, these kinds of things really _can_ be handled more cleanly at that layer, and not in the kernel at all. Afaik, it's what IBM already does, and has been doing for a while. There's no shame in looking at what already works, especially if it's simpler. Could you please clarify as to what "that layer" means - is it the firmware/hardware for virtualization? or does it refer to user space? Virtualization in general. We don't know what it is - in IBM machines it's a hypervisor. With Xen and VMware, it's usually a hypervisor too. With KVM, it's obviously a host Linux kernel/user-process combination. Thanks for clarifying. The point being that in the guests, hotunplug is almost useless (for bigger ranges), and we're much better off just telling the virtualization hosts on a per-page level whether we care about a page or not, than to worry about fragmentation. And in hosts, we usually don't care EITHER, since it's usually done in a hypervisor. It would also be useful to have a resource controller like per-container RSS control (container refers to a task grouping) within the kernel or non-virtualized environments as well. .. but this has again no impact on anti-fragmentation. Yes, I agree that anti-fragmentation and resource management are independent of each other. I must admit to being a bit selfish here, in that my main interest is in resource management and we would love to see a well written and easy to understand resource management infrastructure and controllers to control CPU and memory usage. Since the issue of per-container RSS control came up, I wanted to ensure that we do not mix up resource control and anti-fragmentation. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fastboot] [PATCH RFC 0/5] hard_smp_processor_id overhaul
On Thu, Mar 01, 2007 at 09:06:48AM -0500, Benjamin LaHaise wrote: > On Thu, Mar 01, 2007 at 04:16:13PM +0900, Fernando Luis Vázquez Cao wrote: > > As a consequence, the hardcoding of hard_smp_processor_id() to 0 on UP > > systems (see "linux/smp.h") is not correct. > > > > This patch-set does the following: > > > > 1- Remove hardcoding of hard_smp_processor_id on UP systems. > > NAK. This has to be configurable, as many embedded systems don't even > have APICs. Please rework the patch set so that there is not any overhead > for existing UP systems. Fernando did the code audit and found no instance of hard_smp_processor_id being used for non APIC case. So are embedded systems you are referring, patching the kernel? Anyway, I think providing hard_smp_processor_id() definition for UP systems without APIC does not harm. Thanks Vivek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Nick Piggin wrote: > > I would say that anti-frag / defrag enables memory unplug. > > Well that really depends. If you want to have any sort of guaranteed > amount of unplugging or shrinking (or hugepage allocating), then antifrag > doesn't work because it is a heuristic. We would need additional measures such as real defrag and make more structure movable. > One thing that worries me about anti-fragmentation is that people might > actually start _using_ higher order pages in the kernel. Then fragmentation > comes back, and it's worse because now it is not just the fringe hugepage or > unplug users (who can anyway work around the fragmentation by allocating > from reserve zones). Yes, we (SGI) need exactly that: Use of higher order pages in the kernel in order to reduce overhead of managing page structs for large I/O and large memory applications. We need appropriate measures to deal with the fragmentation problem. > > Thats a value judgement that I doubt. Zone based balancing is bad and has > > been repeatedly patched up so that it works with the usual loads. > > Shouldn't we fix it instead of deciding it is broken and add another layer > on top that supposedly does better balancing? We need to reduce the real hardware zones as much as possible. Most high performance architectures have no need for additional DMA zones f.e. and do not have to deal with the complexities that arise there. > But just because zones are hardware _now_ doesn't mean they have to stay > that way. The upshot is that a lot of work for zones is already there. Well you cannot get there without the nodes. The control of memory allocations with user space support etc only comes with the nodes. > > A. moveable/unmovable > > B. DMA restrictions > > C. container assignment. > > There are alternatives to adding a new layer of virtual zones. We could try > using zones, enven. No merge them to one thing and handle them as one. No difference between zones and nodes anymore. > zones aren't perfect right now, but they are quite similar to what you > want (ie. blocks of memory). I think we should first try to generalise what > we have rather than adding another layer. Yes that would mean merging nodes and zones. So "nones". - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > In other words, I really don't see a huge upside. I see *lots* of > downsides, but upsides? Not so much. Almost everybody who wants unplug > wants virtualization, and right now none of the "big virtualization" > people would want to have kernel-level anti-fragmentation anyway sicne > they'd need to do it on their own. Agree with all that, but you're missing the other application: power saving. FBDIMMs take eight watts a pop. If we can turn them off when the system is unloaded we save either four or all eight watts (assuming we can get Intel to part with the information which is needed to do this. I fear an ACPI method will ensue). There's a whole lot of complexity and work in all of this, but 24*8 watts is a lot of watts, and it's worth striving for. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, Mar 01, 2007 at 07:05:48PM -0800, Christoph Lameter wrote: > On Thu, 1 Mar 2007, Andrew Morton wrote: > > For prioritisation purposes I'd judge that memory hot-unplug is of similar > > value to the antifrag work (because memory hot-unplug permits DIMM > > poweroff). > > I would say that anti-frag / defrag enables memory unplug. Well that really depends. If you want to have any sort of guaranteed amount of unplugging or shrinking (or hugepage allocating), then antifrag doesn't work because it is a heuristic. One thing that worries me about anti-fragmentation is that people might actually start _using_ higher order pages in the kernel. Then fragmentation comes back, and it's worse because now it is not just the fringe hugepage or unplug users (who can anyway work around the fragmentation by allocating from reserve zones). > > Our basic unit of memory management is the zone. Right now, a zone maps > > onto some hardware-imposed thing. But the zone-based MM works *well*. I > > Thats a value judgement that I doubt. Zone based balancing is bad and has > been repeatedly patched up so that it works with the usual loads. Shouldn't we fix it instead of deciding it is broken and add another layer on top that supposedly does better balancing? > > suspect that a good way to solve both per-container RSS and mem hotunplug > > is to split the zone concept away from its hardware limitations: create a > > "software zone" and a "hardware zone". All the existing page allocator and > > reclaim code remains basically unchanged, and it operates on "software > > zones". Each software zones always lies within a single hardware zone. > > The software zones are resizeable. For per-container RSS we give each > > container one (or perhaps multiple) resizeable software zones. > > Resizable software zones? Are they contiguous or not? If not then we > add another layer to the defrag problem. I think Andrew is proposing that we work out what the problem is first. I don't know what the defrag problem is, but I know that fragmentation is unavoidable unless you have fixed size areas for each different size of unreclaimable allocation. > > NUMA and cpusets screwed up: they've gone and used nodes as their basic > > unit of memory management whereas they should have used zones. This will > > need to be untangled. > > zones have hardware characteristics at its core. In a NUMA setting zones > determine the performance of loads from those areas. I would like to have > zones and nodes merged. Maybe extend node numbers into the negative area > -1 = DMA -2 DMA32 etc? All systems then manage the "nones" (node / zones > meerged). One could create additional "virtual" nones after the real nones > that have hardware characteristics behind them. The virtual nones would be > something like the software zones? Contain MAX_ORDER portions of hardware > nones? But just because zones are hardware _now_ doesn't mean they have to stay that way. The upshot is that a lot of work for zones is already there. > > Anyway, that's just a shot in the dark. Could be that we implement unplug > > and RSS control by totally different means. But I do wish that we'd sort > > out what those means will be before we potentially complicate the story a > > lot by adding antifragmentation. > > Hmmm My shot: > > 1. Merge zones/nodes > > 2. Create new virtual zones/nodes that are subsets of MAX_order blocks of > the real zones/nodes. These may then have additional characteristics such > as > > A. moveable/unmovable > B. DMA restrictions > C. container assignment. There are alternatives to adding a new layer of virtual zones. We could try using zones, enven. zones aren't perfect right now, but they are quite similar to what you want (ie. blocks of memory). I think we should first try to generalise what we have rather than adding another layer. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + fully-honor-vdso_enabled.patch added to -mm tree
On Thu, Mar 01, 2007 at 08:52:07PM +0300, Oleg Nesterov wrote: > > --- a/arch/i386/kernel/sysenter.c~fully-honor-vdso_enabled > > +++ a/arch/i386/kernel/sysenter.c > > @@ -22,6 +22,8 @@ > > #include > > #include > > #include > > +#include > > +#include > > > > /* > > * Should the kernel map a VDSO page into processes and pass its > > @@ -105,10 +107,25 @@ int arch_setup_additional_pages(struct l > > { > > struct mm_struct *mm = current->mm; > > unsigned long addr; > > + unsigned long flags; > > int ret; > > > > + switch (vdso_enabled) { > > + case 0: /* none */ > > + return 0; > > This means we don't initialize mm->context.vdso and ->sysenter_return. > > Is it ok? For example, setup_rt_frame() uses VDSO_SYM(&__kernel_rt_sigreturn), > sysenter_past_esp pushes ->sysenter_return on stack. > The setup_rt_frame() case is fairly straightforward, both PPC and SH already check to make sure there's a valid context before trying to use VDSO_SYM(), I'm not sure why x86 doesn't. Though I wonder if there's any point in checking binfmt->hasvdso here? There shouldn't be a valid mm->context.vdso in the !hasvdso case.. Someone else will have to comment on ->sysenter_return. Signed-off-by: Paul Mundt <[EMAIL PROTECTED]> -- arch/i386/kernel/signal.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/i386/kernel/signal.c b/arch/i386/kernel/signal.c index 4f99e87..f778d34 100644 --- a/arch/i386/kernel/signal.c +++ b/arch/i386/kernel/signal.c @@ -350,7 +350,7 @@ static int setup_frame(int sig, struct k_sigaction *ka, goto give_sigsegv; } - if (current->binfmt->hasvdso) + if (current->binfmt->hasvdso && current->mm->context.vdso) restorer = (void *)VDSO_SYM(&__kernel_sigreturn); else restorer = (void *)&frame->retcode; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Fri, 2 Mar 2007, Balbir Singh wrote: > > > My personal opinion is that while I'm not a huge fan of virtualization, > > these kinds of things really _can_ be handled more cleanly at that layer, > > and not in the kernel at all. Afaik, it's what IBM already does, and has > > been doing for a while. There's no shame in looking at what already works, > > especially if it's simpler. > > Could you please clarify as to what "that layer" means - is it the > firmware/hardware for virtualization? or does it refer to user space? Virtualization in general. We don't know what it is - in IBM machines it's a hypervisor. With Xen and VMware, it's usually a hypervisor too. With KVM, it's obviously a host Linux kernel/user-process combination. The point being that in the guests, hotunplug is almost useless (for bigger ranges), and we're much better off just telling the virtualization hosts on a per-page level whether we care about a page or not, than to worry about fragmentation. And in hosts, we usually don't care EITHER, since it's usually done in a hypervisor. > It would also be useful to have a resource controller like per-container > RSS control (container refers to a task grouping) within the kernel or > non-virtualized environments as well. .. but this has again no impact on anti-fragmentation. In other words, I really don't see a huge upside. I see *lots* of downsides, but upsides? Not so much. Almost everybody who wants unplug wants virtualization, and right now none of the "big virtualization" people would want to have kernel-level anti-fragmentation anyway sicne they'd need to do it on their own. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007 16:09:15 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > On Thu, 1 Mar 2007 10:12:50 + > [EMAIL PROTECTED] (Mel Gorman) wrote: > > > Any opinion on merging these patches into -mm > > for wider testing? > > I'm a little reluctant to make changes to -mm's core mm unless those > changes are reasonably certain to be on track for mainline, so let's talk > about that. > > What worries me is memory hot-unplug and per-container RSS limits. We > don't know how we're going to do either of these yet, and it could well be > that the anti-frag work significantly complexicates whatever we end up > doing there. > > For prioritisation purposes I'd judge that memory hot-unplug is of similar > value to the antifrag work (because memory hot-unplug permits DIMM > poweroff). About memory-hot-unplug, I'm now writing a new patch-set for memory-unplug for showing my overview and roadmap. I'm now debugging it. I think I will be able to post them as RFC in a week. At least, ZONE_MOVABLE(or something partitioning memory) is necessary for memory-hot-unplug like DIMM-poweroff. (I'm now using my own ZONE_MOVABLE patch, but It is O.K. to migrate to Mel's one if it's ready to be merged.) > Our basic unit of memory management is the zone. Right now, a zone maps > onto some hardware-imposed thing. But the zone-based MM works *well*. I > suspect that a good way to solve both per-container RSS and mem hotunplug > is to split the zone concept away from its hardware limitations: create a > "software zone" and a "hardware zone". All the existing page allocator and > reclaim code remains basically unchanged, and it operates on "software > zones". Each software zones always lies within a single hardware zone. > The software zones are resizeable. For per-container RSS we give each > container one (or perhaps multiple) resizeable software zones. > > For memory hotunplug, some of the hardware zone's software zones are marked > reclaimable and some are not; DIMMs which are wholly within reclaimable > zones can be depopulated and powered off or removed. > Hmm...software-zone seems attractive. I remember someone posted pesuedo-zone(pzone) patch in past. -Kame - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
On Fri, 2 Mar 2007 02:29:19 + Christoph Hellwig <[EMAIL PROTECTED]> wrote: > On Thu, Mar 01, 2007 at 05:42:04PM -0800, Andrew Morton wrote: > > Something funny is going on here. > > Not so funny for those who've tried to sort out the issue over > the past years and just got ignored.. > > > Generally, one should increment the refcount of a page when it is put into > > some container. That means that the page should get +1 when it is added to > > a bio. (direct-io does this, but the mpage.c pagecache code cheats, and > > relies upon PG_locked and PG-writeback protecting the page). > > It's a slab page, and slab pages aren't refcounted (which is a good thing > as you don't own the whole page) ah, I see. > > Similarly, the network code (or its caller) should be incrementing the > > page's refcount as the page goes into a container (ie: the skb) and > > decrementing it as the page is removed. > > > > But someone somewhere is breaking those rules. Who? > > slab code. Well I spose slab _could_ take a ref on these pages. > > So. Who is breaking refcounting protocol here? Perhaps it is AOE, failing > > to increment the refcount on pages as they are added to an skb? > > > > (Do we know which callsite in XFS is adding zero-ref pages to a BIO, btw?) > > For example all log I/O is done from kmalloce pages. > > Anyway, to rehash what I've been trying to get clarified for ages: > > > (1) should we allow to pass slab pages into bios > > and > > (2) if yes what's the way lower layers are supposed to handle them > for any possible refcounting operations like networking or rdma. > > There's also a pontial caller in ext3 that can send down kmalloc'ed > buffers: journal_write_metadata_buffer() in need_copy_out && !done_copy_out > case. But apparently that's an almost dead code path as I've never > seen anyone tripping this one, it's always XFS that people report. OK. Let's go through it. Networking internally maintains caller memory lifetimes, and it assumes that the caller allocated memory via __alloc_pages() - because it uses get_page() and put_page(). BIO, however, does not internally manage caller memory lifetime. This is because the caller's ->bi_end_io is always called, so the caller can do it. So where we've come unstuck is in a module which has gone and fed BIO memory into networking. The differing design philosophies are clashing. I'm surprised this doesn't happen in other places - aren't there any other drivers which take a BIO and stuff it down the network? Anyway, where's the bug? Really, I'd say it's XFS (and ext3). Even though BIO doesn't presently manage page lifetimes, it _could_. After all, the function is called bio_add_page(), not bio_add_virtual_address(). It's a bit hacky to kmalloc some memory, run virt_to_page() and to then present that page to BIO even though the caller (thanks to the slab optimisation) doesn't actually have control of that page's lifetime. So we have a few options to look at: a) kludge things in AOE. Unpleasing, and might cause memory leaks (although it won't, because the caller hasn't run bi_end_io yet). b) Take a ref on slab pages in slab. A bit costly, perhaps. c) teach ext3 and XFS to take a ref on these pages as they are added to the BIOs, undo that ref in bi_end_io. I think c)? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rcutorture: Mark rcu_torture_init as __init
On Thu, Mar 01, 2007 at 11:29:03AM -0800, Josh Triplett wrote: Acked-by: Paul E. McKenney <[EMAIL PROTECTED]> > Signed-off-by: Josh Triplett <[EMAIL PROTECTED]> > --- > The corresponding rcu_torture_cleanup cannot get marked as __exit, because > rcu_torture_init uses it to clean up if init fails. > > kernel/rcutorture.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c > index 7258bcb..df49eca 100644 > --- a/kernel/rcutorture.c > +++ b/kernel/rcutorture.c > @@ -866,7 +866,7 @@ rcu_torture_cleanup(void) > rcu_torture_print_module_parms("End of test: SUCCESS"); > } > > -static int > +static int __init > rcu_torture_init(void) > { > int i; > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] hard_smp_processor_id overhaul
On Thu, 2007-03-01 at 09:06 -0500, Benjamin LaHaise wrote: > On Thu, Mar 01, 2007 at 04:16:13PM +0900, Fernando Luis Vázquez Cao wrote: > > As a consequence, the hardcoding of hard_smp_processor_id() to 0 on UP > > systems (see "linux/smp.h") is not correct. > > > > This patch-set does the following: > > > > 1- Remove hardcoding of hard_smp_processor_id on UP systems. > > NAK. This has to be configurable, as many embedded systems don't even > have APICs. Please rework the patch set so that there is not any overhead > for existing UP systems. In i386 (with the exception of voyager) and x86_64, hard_smp_processor_id is not used anywhere in the kernel when there are no APICs available. Regarding the overhead, hard_smp_processor_id is used mostly during initialization and doesn't seem to be used in any fast path in i386, x86_64, and ia64. All the other architectures are not affected by this patch, because I kept the hardcoding of hard_smp_processor_id on UP kernels, and just moved the definition to asm/smp.h because it should be handled by architecture-speficic code. So unless strictly necessary I would not like to make this patches dependent on kdump. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] Vmi fix highpte
Zachary Amsden wrote: > Provide a PT map hook for HIGHPTE kernels to designate where they are mapping > page tables. This information is required so the physical address of PTE > updates can be determined; otherwise, the mm layer would have to carry the > physical address all the way to each PTE modification callsite, which is > even more hideous that the macros required to provide the proper hooks. > > So lets not mess up arch neutral code to achieve this, but keep the horror > in an #ifdef HIGHPTE in include/asm-i386/pgtable.h. I had to use macros > here because some types are not yet defined in all the include paths for > this header. > > This patch is absolutely required for HIGHPTE kernels to operate properly > with VMI. > Hm, I don't think this interface will work for Xen. In Xen, whenever a pagetable page gets mapped, it must be mapped RO. map_pt_hook gets called after the mapping has already been created, so its too late for Xen. I was planning on adding kmap_atomic_pte() for use in pte_offset_map*(), which would be wired through to paravirt_ops to allow Xen to make this a RO mapping. Would this be sufficient for you to do your vmi thing? J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
On Thu, 1 Mar 2007, Andrew Morton wrote: > What worries me is memory hot-unplug and per-container RSS limits. We > don't know how we're going to do either of these yet, and it could well be > that the anti-frag work significantly complexicates whatever we end up > doing there. Right now it seems that the per container RSS limits differ from the statistics calculated per zone. There would be a conceptual overlap but the containers are optional and track numbers differently. There is no RSS counter in a zone f.e. memory hot-unplug would directly tap into the anti-frag work. Essentially only the zone with movable pages would be unpluggable without additional measures. Making slab items and other allocations that is fixed movable requires work anyways. A new zone concept will not help. > For prioritisation purposes I'd judge that memory hot-unplug is of similar > value to the antifrag work (because memory hot-unplug permits DIMM > poweroff). I would say that anti-frag / defrag enables memory unplug. > And I'd judge that per-container RSS limits are of considerably more value > than antifrag (in fact per-container RSS might be a superset of antifrag, > in the sense that per-container RSS and containers could be abused to fix > the i-cant-get-any-hugepages problem, dunno). They relate? How can a container perform antifrag? Meaning a container reserves a portion of a hardware zone and becomes a software zone. > So some urgent questions are: how are we going to do mem hotunplug and > per-container RSS? Separately. There is no need to mingle these two together. > Our basic unit of memory management is the zone. Right now, a zone maps > onto some hardware-imposed thing. But the zone-based MM works *well*. I Thats a value judgement that I doubt. Zone based balancing is bad and has been repeatedly patched up so that it works with the usual loads. > suspect that a good way to solve both per-container RSS and mem hotunplug > is to split the zone concept away from its hardware limitations: create a > "software zone" and a "hardware zone". All the existing page allocator and > reclaim code remains basically unchanged, and it operates on "software > zones". Each software zones always lies within a single hardware zone. > The software zones are resizeable. For per-container RSS we give each > container one (or perhaps multiple) resizeable software zones. Resizable software zones? Are they contiguous or not? If not then we add another layer to the defrag problem. > For memory hotunplug, some of the hardware zone's software zones are marked > reclaimable and some are not; DIMMs which are wholly within reclaimable > zones can be depopulated and powered off or removed. So subzones indeed. How about calling the MAX_ORDER entities that Mel's patches create "software zones"? > NUMA and cpusets screwed up: they've gone and used nodes as their basic > unit of memory management whereas they should have used zones. This will > need to be untangled. zones have hardware characteristics at its core. In a NUMA setting zones determine the performance of loads from those areas. I would like to have zones and nodes merged. Maybe extend node numbers into the negative area -1 = DMA -2 DMA32 etc? All systems then manage the "nones" (node / zones meerged). One could create additional "virtual" nones after the real nones that have hardware characteristics behind them. The virtual nones would be something like the software zones? Contain MAX_ORDER portions of hardware nones? > Anyway, that's just a shot in the dark. Could be that we implement unplug > and RSS control by totally different means. But I do wish that we'd sort > out what those means will be before we potentially complicate the story a > lot by adding antifragmentation. Hmmm My shot: 1. Merge zones/nodes 2. Create new virtual zones/nodes that are subsets of MAX_order blocks of the real zones/nodes. These may then have additional characteristics such as A. moveable/unmovable B. DMA restrictions C. container assignment. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc1: CIFS cheers, NFS4 jeers
On Wed, Feb 28, 2007 at 09:52:34PM -0800, Andrew Morton wrote: > On Mon, 26 Feb 2007 00:45:00 -0600 [EMAIL PROTECTED] (Florin Iucha) wrote: > > > Hello, it's me and my 70 GB of photos again. [snip] > > Running 'top', one core is idle and the other is 99% waiting, while > > the 'cp' program is in 'D' state. Also, after NFSv4 stalls, invokations > > of 'lsof' stall as well. I can 'ssh' into the box without problems. > > and > > > The kernel on the client is 2.6.21-rc1 (but it echoes problems I > > reported in December with 2.6.20 series as well) as can be seen from > > the kernel logs. > > > > I have corrected the links: > > > >http://iucha.net/21-rc1/before.1 > >http://iucha.net/21-rc1/after.1 > >http://iucha.net/21-rc1/config-2.6.21-rc1 > > > > The relevant part is: > > [ 1215.657827] cpD 00f86f105704 0 2859 2843 > (NOTLB) > [ 1215.657833] 81007343faa8 0082 > 81007343fb58 > [ 1215.657837] 0002 81007343faa8 0008 > 81007e578ee0 > [ 1215.657842] 810002f4a080 2150 81007e5790b8 > 00017343fb50 > [ 1215.657847] Call Trace: > [ 1215.657852] [] io_schedule+0x28/0x34 > [ 1215.657856] [] sync_page+0x41/0x45 > [ 1215.657859] [] __wait_on_bit+0x45/0x77 > [ 1215.657862] [] sync_page+0x0/0x45 > [ 1215.657867] [] wait_on_page_bit+0x6e/0x75 > [ 1215.657870] [] wake_bit_function+0x0/0x2a > [ 1215.657874] [] pagevec_lookup_tag+0x22/0x2b > [ 1215.657878] [] wait_on_page_writeback_range+0x6e/0x142 > [ 1215.657885] [] filemap_fdatawait+0x20/0x22 > [ 1215.657889] [] filemap_write_and_wait+0x29/0x38 > [ 1215.657894] [] nfs_setattr+0xa0/0x11a > [ 1215.657897] [] link_path_walk+0xe8/0xfc > [ 1215.657902] [] autoremove_wake_function+0x0/0x38 > [ 1215.657907] [] poison_obj+0x27/0x32 > [ 1215.657910] [] current_fs_time+0x3f/0x41 > [ 1215.657913] [] __user_walk_fd+0x53/0x62 > [ 1215.657918] [] notify_change+0x129/0x238 > [ 1215.657923] [] do_utimes+0xfc/0x126 > [ 1215.657928] [] _raw_spin_lock+0xf3/0xf9 > [ 1215.657933] [] sys_futimesat+0x45/0x56 > [ 1215.657937] [] sys_utimes+0x14/0x16 > [ 1215.657941] [] system_call+0x7e/0x83 > > seems that we've simply lost an IO completion. > > Was 2.6.19 OK? I just tested, 2.6.19 is OK! Kernel log output after the cp and sync completed are at http://iucha.net/19/before http://iucha.net/19/after (after echo t > /proc/sysrq-trigger) When I get a chance I will try again, and report if it fails. But so far it seems fine: df and lsof work as expected. Thanks, florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
[PATCH 7/9] Fix nohz compile.patch
More goo from hrtimers integration. We do compile and run properly with NO_HZ enabled. There was a period when we didn't because of a missing export, but that was since fixed. And with the clocksource code now firmly in place, we can get rid of code that fixes up the wallclock, since this is done in the common infrastructure. This actually fixes a timer bug as well, that was caused by do_settimeofday no longer being callable with interrupts disabled due to the use of on_each_cpu(). Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r 5d41588419ab arch/i386/Kconfig --- a/arch/i386/Kconfig Tue Feb 27 17:24:55 2007 -0800 +++ b/arch/i386/Kconfig Tue Feb 27 17:25:44 2007 -0800 @@ -220,7 +220,7 @@ config PARAVIRT config VMI bool "VMI Paravirt-ops support" - depends on PARAVIRT && !NO_HZ + depends on PARAVIRT default y help VMI provides a paravirtualized interface to multiple hypervisors diff -r 5d41588419ab arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.cTue Feb 27 17:24:55 2007 -0800 +++ b/arch/i386/kernel/vmi.cTue Feb 27 18:46:26 2007 -0800 @@ -934,6 +934,7 @@ void __init vmi_init(void) #ifdef CONFIG_X86_IO_APIC no_timer_check = 1; #endif + no_sync_cmos_clock = 1; local_irq_restore(flags & X86_EFLAGS_IF); } diff -r 5d41588419ab arch/i386/kernel/vmitime.c --- a/arch/i386/kernel/vmitime.cTue Feb 27 17:24:55 2007 -0800 +++ b/arch/i386/kernel/vmitime.cTue Feb 27 18:47:51 2007 -0800 @@ -153,13 +153,6 @@ static void vmi_get_wallclock_ts(struct ts->tv_sec = wallclock; } -static void update_xtime_from_wallclock(void) -{ - struct timespec ts; - vmi_get_wallclock_ts(&ts); - do_settimeofday(&ts); -} - unsigned long vmi_get_wallclock(void) { struct timespec ts; @@ -197,18 +190,10 @@ void __init vmi_time_init(void) set_intr_gate(LOCAL_TIMER_VECTOR, apic_vmi_timer_interrupt); #endif - no_sync_cmos_clock = 1; - - vmi_get_wallclock_ts(&xtime); - set_normalized_timespec(&wall_to_monotonic, - -xtime.tv_sec, -xtime.tv_nsec); - real_cycles_accounted_system = read_real_cycles(); - update_xtime_from_wallclock(); per_cpu(process_times_cycles_accounted_cpu, 0) = read_available_cycles(); cycles_per_sec = vmi_timer_ops.get_cycle_frequency(); - cycles_per_jiffy = cycles_per_sec; (void)do_div(cycles_per_jiffy, HZ); cycles_per_alarm = cycles_per_sec; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/9] Pit override.patch
The time_init_hook in paravirt-ops no longer functions in the correct manner after the integration of the hrtimers code. The problem is that now the call path for time initialization is: time_init : late_time_init = hpet_time_init; late_time_init -> hpet_time_init: setup_pit_timer (BAD) do_time_init --> (via paravirt.h) time_init_hook --> (via arch_hooks.h) time_init_hook (in SUBARCH/setup.c) If this isn't confusing enough, the paravirt case goes through an indirect function pointer in the paravirt-ops table. The problem is, by the time the paravirt hook is called, the pit timer is already enabled. But paravirt guests have their own timer, and don't want to use the PIT. Rather than intensify the struggle for power going on here, just make it all nice and simple and just unconditionally do all timer setup in the late_time_init hook. This also has the advantage of enabling timers in the same place in all code paths, so everyone has the same bugs and we don't have outliers who break other code because they turn on timer too early or too late. So the paravirt-ops time init function is now by default hpet_time_init, which is the time init function used for native hardware. Paravirt guests have the chance to override this when they setup the paravirt-ops table, and should need no change. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r 2ae8eb19b227 arch/i386/kernel/paravirt.c --- a/arch/i386/kernel/paravirt.c Tue Feb 27 16:28:10 2007 -0800 +++ b/arch/i386/kernel/paravirt.c Tue Feb 27 17:08:11 2007 -0800 @@ -494,7 +494,7 @@ struct paravirt_ops paravirt_ops = { .memory_setup = machine_specific_memory_setup, .get_wallclock = native_get_wallclock, .set_wallclock = native_set_wallclock, - .time_init = time_init_hook, + .time_init = hpet_time_init, .init_IRQ = native_init_IRQ, .cpuid = native_cpuid, diff -r 2ae8eb19b227 arch/i386/kernel/time.c --- a/arch/i386/kernel/time.c Tue Feb 27 16:28:10 2007 -0800 +++ b/arch/i386/kernel/time.c Tue Feb 27 16:50:01 2007 -0800 @@ -262,14 +262,22 @@ void notify_arch_cmos_timer(void) extern void (*late_time_init)(void); /* Duplicate of time_init() below, with hpet_enable part added */ -static void __init hpet_time_init(void) +void __init hpet_time_init(void) { if (!hpet_enable()) setup_pit_timer(); - do_time_init(); -} - + time_init_hook(); +} + +/* + * This is called directly from init code; we must delay timer setup in the + * HPET case as we can't make the decision to turn on HPET this early in the + * boot process. + * + * The chosen time_init function will usually be hpet_time_init, above, but + * in the case of virtual hardware, an alternative function may be substituted. + */ void __init time_init(void) { - late_time_init = hpet_time_init; -} + late_time_init = choose_time_init(); +} diff -r 2ae8eb19b227 include/asm-i386/paravirt.h --- a/include/asm-i386/paravirt.h Tue Feb 27 16:28:10 2007 -0800 +++ b/include/asm-i386/paravirt.h Tue Feb 27 17:07:23 2007 -0800 @@ -186,9 +186,9 @@ static inline int set_wallclock(unsigned return paravirt_ops.set_wallclock(nowtime); } -static inline void do_time_init(void) -{ - return paravirt_ops.time_init(); +static inline void (*choose_time_init(void))(void) +{ + return paravirt_ops.time_init; } /* The paravirtualized CPUID instruction. */ diff -r 2ae8eb19b227 include/asm-i386/time.h --- a/include/asm-i386/time.h Tue Feb 27 16:28:10 2007 -0800 +++ b/include/asm-i386/time.h Tue Feb 27 16:50:45 2007 -0800 @@ -28,13 +28,16 @@ static inline int native_set_wallclock(u return retval; } +extern void (*late_time_init)(void); +extern void hpet_time_init(void); + #ifdef CONFIG_PARAVIRT #include #else /* !CONFIG_PARAVIRT */ #define get_wallclock() native_get_wallclock() #define set_wallclock(x) native_set_wallclock(x) -#define do_time_init() time_init_hook() +#define choose_time_init() hpet_time_init #endif /* CONFIG_PARAVIRT */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 9/9] Vmi smp fixes.patch
Critical fixes for SMP. Fix a couple functions which needed to be __devinit and fix a bogus parameter to AP startup that just so happened to work because the low virtual mapping of memory was still established. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r baf2e278a482 arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.cThu Mar 01 18:08:53 2007 -0800 +++ b/arch/i386/kernel/vmi.cThu Mar 01 18:10:18 2007 -0800 @@ -525,13 +525,14 @@ void vmi_pmd_clear(pmd_t *pmd) #endif #ifdef CONFIG_SMP -struct vmi_ap_state ap; extern void setup_pda(void); -static void __init /* XXX cpu hotplug */ +static void __devinit vmi_startup_ipi_hook(int phys_apicid, unsigned long start_eip, unsigned long start_esp) { + struct vmi_ap_state ap; + /* Default everything to zero. This is fine for most GPRs. */ memset(&ap, 0, sizeof(struct vmi_ap_state)); @@ -570,7 +571,7 @@ vmi_startup_ipi_hook(int phys_apicid, un /* Protected mode, paging, AM, WP, NE, MP. */ ap.cr0 = 0x80050023; ap.cr4 = mmu_cr4_features; - vmi_ops.set_initial_ap_state(__pa(&ap), phys_apicid); + vmi_ops.set_initial_ap_state((u32)&ap, phys_apicid); } #endif diff -r baf2e278a482 arch/i386/kernel/vmitime.c --- a/arch/i386/kernel/vmitime.cThu Mar 01 18:08:53 2007 -0800 +++ b/arch/i386/kernel/vmitime.cThu Mar 01 18:08:53 2007 -0800 @@ -243,7 +243,7 @@ void __init vmi_timer_setup_boot_alarm(v /* Initialize the time accounting variables for an AP on an SMP system. * Also, set the local alarm for the AP. */ -void __init vmi_timer_setup_secondary_alarm(void) +void __devinit vmi_timer_setup_secondary_alarm(void) { int cpu = smp_processor_id(); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 8/9] Vmi apic ops.diff
Use para_fill instead of directly setting the APIC ops to the result of the vmi_get_function call - this allows one to implement a VMI ROM without implementing APIC functions, just using the native APIC functions. While doing this, I realized that there is a lot more cleanup that should have been done. Basically, we should never assume that the ROM implements a specific set of functions, and always allow fallback to the native implementation. This is critical for future compatibility. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r 0ba8434a5c7e arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.cThu Mar 01 16:49:27 2007 -0800 +++ b/arch/i386/kernel/vmi.cThu Mar 01 16:49:33 2007 -0800 @@ -54,6 +54,7 @@ static int disable_tsc; static int disable_tsc; static int disable_mtrr; static int disable_noidle; +static int disable_vmi_timer; /* Cached VMI operations */ struct { @@ -662,12 +663,12 @@ void vmi_bringup(void) void vmi_bringup(void) { /* We must establish the lowmem mapping for MMU ops to work */ - if (vmi_rom) + if (vmi_ops.set_linear_mapping) vmi_ops.set_linear_mapping(0, __PAGE_OFFSET, max_low_pfn, 0); } /* - * Return a pointer to the VMI function or a NOP stub + * Return a pointer to a VMI function or NULL if unimplemented */ static void *vmi_get_function(int vmicall) { @@ -678,12 +679,13 @@ static void *vmi_get_function(int vmical if (rel->type == VMI_RELOCATION_CALL_REL) return (void *)rel->eip; else - return (void *)vmi_nop; + return NULL; } /* * Helper macro for making the VMI paravirt-ops fill code readable. - * For unimplemented operations, fall back to default. + * For unimplemented operations, fall back to default, unless nop + * is returned by the ROM. */ #define para_fill(opname, vmicall) \ do { \ @@ -692,8 +694,28 @@ do { \ if (rel->type != VMI_RELOCATION_NONE) { \ BUG_ON(rel->type != VMI_RELOCATION_CALL_REL); \ paravirt_ops.opname = (void *)rel->eip; \ + } else if (rel->type == VMI_RELOCATION_NOP) \ + paravirt_ops.opname = (void *)vmi_nop; \ +} while (0) + +/* + * Helper macro for making the VMI paravirt-ops fill code readable. + * For cached operations which do not match the VMI ROM ABI and must + * go through a tranlation stub. Ignore NOPs, since it is not clear + * a NOP * VMI function corresponds to a NOP paravirt-op when the + * functions are not in 1-1 correspondence. + */ +#define para_wrap(opname, wrapper, cache, vmicall) \ +do { \ + reloc = call_vrom_long_func(vmi_rom, get_reloc, \ + VMI_CALL_##vmicall);\ + BUG_ON(rel->type == VMI_RELOCATION_JUMP_REL); \ + if (rel->type == VMI_RELOCATION_CALL_REL) { \ + paravirt_ops.opname = wrapper; \ + vmi_ops.cache = (void *)rel->eip; \ } \ } while (0) + /* * Activate the VMI interface and switch into paravirtualized mode @@ -731,13 +753,8 @@ static inline int __init activate_vmi(vo * rdpmc is not yet used in Linux */ - /* CPUID is special, so very special */ - reloc = call_vrom_long_func(vmi_rom, get_reloc, VMI_CALL_CPUID); - if (rel->type != VMI_RELOCATION_NONE) { - BUG_ON(rel->type != VMI_RELOCATION_CALL_REL); - vmi_ops.cpuid = (void *)rel->eip; - paravirt_ops.cpuid = vmi_cpuid; - } + /* CPUID is special, so very special it gets wrapped like a present */ + para_wrap(cpuid, vmi_cpuid, cpuid, CPUID); para_fill(clts, CLTS); para_fill(get_debugreg, GetDR); @@ -754,6 +771,7 @@ static inline int __init activate_vmi(vo para_fill(restore_fl, SetInterruptMask); para_fill(irq_disable, DisableInterrupts); para_fill(irq_enable, EnableInterrupts); + /* irq_save_disable !!! sheer pain */ patch_offset(&irq_save_disable_callout[IRQ_PATCH_INT_MASK], (char *)paravirt_ops.save_fl); @@ -761,26 +779,18 @@ static inline int __init activate_vmi(vo (char *)paravirt_ops.irq_disable); para_fill(wbinvd, WBINVD); + para_fill(read_tsc, RDTSC); + + /* The following we emulate with trap and emulate for now */ /* paravirt_ops.read_msr = vmi_rdmsr */ /* paravirt_ops.write_msr = vmi_wrmsr */ - para_fill(read_tsc, RDTSC); /* paravirt_ops.rdpmc = vmi_rdpmc */ - /* TR interface doesn't pass TR value
[PATCH 5/9] Paravirt drop udelay op
Not respecting udelay causes problems with any virtual hardware that is passed through to real hardware. This can be noticed by any device that interacts with the real world in real time - like AP startup, which takes real time. Or keyboard LEDs, which should blink in real-time. Or floppy drives, but only when passed through to a real floppy controller on OSes which can't sufficiently buffer the floppy commands to emulate a zero latency floppy. Or IDE drives, when connecting to a physical CDROM. This was mostly a hack to get the kernel to boot faster, but it introduced a number of misvirtualization bugs, and Alan and Pavel argued pretty strongly against it. We were the only client, and now want to clean up this cruft. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r 135d1b73c878 arch/i386/kernel/paravirt.c --- a/arch/i386/kernel/paravirt.c Tue Feb 27 16:23:56 2007 -0800 +++ b/arch/i386/kernel/paravirt.c Tue Feb 27 16:25:26 2007 -0800 @@ -538,7 +538,6 @@ struct paravirt_ops paravirt_ops = { .set_iopl_mask = native_set_iopl_mask, .io_delay = native_io_delay, - .const_udelay = __const_udelay, #ifdef CONFIG_X86_LOCAL_APIC .apic_write = native_apic_write, diff -r 135d1b73c878 arch/i386/kernel/smpboot.c --- a/arch/i386/kernel/smpboot.cTue Feb 27 16:23:56 2007 -0800 +++ b/arch/i386/kernel/smpboot.cTue Feb 27 16:27:16 2007 -0800 @@ -33,11 +33,6 @@ * Dave Jones : Report invalid combinations of Athlon CPUs. * Rusty Russell : Hacked into shape for new "hotplug" boot process. */ - -/* SMP boot always wants to use real time delay to allow sufficient time for - * the APs to come online */ -#define USE_REAL_TIME_DELAY - #include #include #include diff -r 135d1b73c878 arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.cTue Feb 27 16:23:56 2007 -0800 +++ b/arch/i386/kernel/vmi.cTue Feb 27 16:28:00 2007 -0800 @@ -48,7 +48,6 @@ typedef u64 __attribute__((regparm(2))) static struct vrom_header *vmi_rom; static int license_gplok; -static int disable_nodelay; static int disable_pge; static int disable_pse; static int disable_sep; @@ -801,9 +800,6 @@ static inline int __init activate_vmi(vo para_fill(set_iopl_mask, SetIOPLMask); paravirt_ops.io_delay = (void *)vmi_nop; - if (!disable_nodelay) { - paravirt_ops.const_udelay = (void *)vmi_nop; - } para_fill(set_lazy_mode, SetLazyMode); @@ -947,9 +943,7 @@ static int __init parse_vmi(char *arg) if (!arg) return -EINVAL; - if (!strcmp(arg, "disable_nodelay")) - disable_nodelay = 1; - else if (!strcmp(arg, "disable_pge")) { + if (!strcmp(arg, "disable_pge")) { clear_bit(X86_FEATURE_PGE, boot_cpu_data.x86_capability); disable_pge = 1; } else if (!strcmp(arg, "disable_pse")) { diff -r 135d1b73c878 include/asm-i386/delay.h --- a/include/asm-i386/delay.h Tue Feb 27 16:23:56 2007 -0800 +++ b/include/asm-i386/delay.h Tue Feb 27 16:26:01 2007 -0800 @@ -16,13 +16,6 @@ extern void __const_udelay(unsigned long extern void __const_udelay(unsigned long usecs); extern void __delay(unsigned long loops); -#if defined(CONFIG_PARAVIRT) && !defined(USE_REAL_TIME_DELAY) -#define udelay(n) paravirt_ops.const_udelay((n) * 0x10c7ul) - -#define ndelay(n) paravirt_ops.const_udelay((n) * 5ul) - -#else /* !PARAVIRT || USE_REAL_TIME_DELAY */ - /* 0x10c7 is 2**32 / 100 (rounded up) */ #define udelay(n) (__builtin_constant_p(n) ? \ ((n) > 2 ? __bad_udelay() : __const_udelay((n) * 0x10c7ul)) : \ @@ -32,7 +25,6 @@ extern void __delay(unsigned long loops) #define ndelay(n) (__builtin_constant_p(n) ? \ ((n) > 2 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ __ndelay(n)) -#endif void use_tsc_delay(void); diff -r 135d1b73c878 include/asm-i386/paravirt.h --- a/include/asm-i386/paravirt.h Tue Feb 27 16:23:56 2007 -0800 +++ b/include/asm-i386/paravirt.h Tue Feb 27 16:25:39 2007 -0800 @@ -117,7 +117,6 @@ struct paravirt_ops void (*set_iopl_mask)(unsigned mask); void (*io_delay)(void); - void (*const_udelay)(unsigned long loops); #ifdef CONFIG_X86_LOCAL_APIC void (*apic_write)(unsigned long reg, unsigned long v); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/9] Vmi fix highpte
Provide a PT map hook for HIGHPTE kernels to designate where they are mapping page tables. This information is required so the physical address of PTE updates can be determined; otherwise, the mm layer would have to carry the physical address all the way to each PTE modification callsite, which is even more hideous that the macros required to provide the proper hooks. So lets not mess up arch neutral code to achieve this, but keep the horror in an #ifdef HIGHPTE in include/asm-i386/pgtable.h. I had to use macros here because some types are not yet defined in all the include paths for this header. This patch is absolutely required for HIGHPTE kernels to operate properly with VMI. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r 87bf6b2d338d arch/i386/kernel/paravirt.c --- a/arch/i386/kernel/paravirt.c Tue Feb 27 14:14:34 2007 -0800 +++ b/arch/i386/kernel/paravirt.c Tue Feb 27 14:14:36 2007 -0800 @@ -553,6 +553,8 @@ struct paravirt_ops paravirt_ops = { .flush_tlb_kernel = native_flush_tlb_global, .flush_tlb_single = native_flush_tlb_single, + .map_pt_hook = (void *)native_nop, + .alloc_pt = (void *)native_nop, .alloc_pd = (void *)native_nop, .alloc_pd_clone = (void *)native_nop, diff -r 87bf6b2d338d arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.cTue Feb 27 14:14:34 2007 -0800 +++ b/arch/i386/kernel/vmi.cTue Feb 27 16:23:37 2007 -0800 @@ -370,6 +370,24 @@ static void vmi_check_page_type(u32 pfn, #define vmi_check_page_type(p,t) do { } while (0) #endif +static void vmi_map_pt_hook(int type, pte_t *va, u32 pfn) +{ + /* +* Internally, the VMI ROM must map virtual addresses to physical +* addresses for processing MMU updates. By the time MMU updates +* are issued, this information is typically already lost. +* Fortunately, the VMI provides a cache of mapping slots for active +* page tables. +* +* We use slot zero for the linear mapping of physical memory, and +* in HIGHPTE kernels, slot 1 and 2 for KM_PTE0 and KM_PTE1. +* +* args: SLOT VACOUNT PFN +*/ + BUG_ON(type != KM_PTE0 && type != KM_PTE1); + vmi_ops.set_linear_mapping((type - KM_PTE0)+1, (u32)va, 1, pfn); +} + static void vmi_allocate_pt(u32 pfn) { vmi_set_page_type(pfn, VMI_PAGE_L1); @@ -813,6 +831,7 @@ static inline int __init activate_vmi(vo vmi_ops.allocate_page = vmi_get_function(VMI_CALL_AllocatePage); vmi_ops.release_page = vmi_get_function(VMI_CALL_ReleasePage); + paravirt_ops.map_pt_hook = vmi_map_pt_hook; paravirt_ops.alloc_pt = vmi_allocate_pt; paravirt_ops.alloc_pd = vmi_allocate_pd; paravirt_ops.alloc_pd_clone = vmi_allocate_pd_clone; diff -r 87bf6b2d338d include/asm-i386/paravirt.h --- a/include/asm-i386/paravirt.h Tue Feb 27 14:14:34 2007 -0800 +++ b/include/asm-i386/paravirt.h Tue Feb 27 16:21:22 2007 -0800 @@ -131,6 +131,8 @@ struct paravirt_ops void (*flush_tlb_kernel)(void); void (*flush_tlb_single)(u32 addr); + void (fastcall *map_pt_hook)(int type, pte_t *va, u32 pfn); + void (*alloc_pt)(u32 pfn); void (*alloc_pd)(u32 pfn); void (*alloc_pd_clone)(u32 pfn, u32 clonepfn, u32 start, u32 count); @@ -354,6 +356,8 @@ static inline void startup_ipi_hook(int #define __flush_tlb_global() paravirt_ops.flush_tlb_kernel() #define __flush_tlb_single(addr) paravirt_ops.flush_tlb_single(addr) +#define paravirt_map_pt_hook(type, va, pfn) paravirt_ops.map_pt_hook(type, va, pfn) + #define paravirt_alloc_pt(pfn) paravirt_ops.alloc_pt(pfn) #define paravirt_release_pt(pfn) paravirt_ops.release_pt(pfn) diff -r 87bf6b2d338d include/asm-i386/pgtable.h --- a/include/asm-i386/pgtable.hTue Feb 27 14:14:34 2007 -0800 +++ b/include/asm-i386/pgtable.hTue Feb 27 16:19:54 2007 -0800 @@ -263,6 +263,7 @@ static inline pte_t pte_mkhuge(pte_t pte */ #define pte_update(mm, addr, ptep) do { } while (0) #define pte_update_defer(mm, addr, ptep) do { } while (0) +#define paravirt_map_pt_hook(slot, va, pfn)do { } while (0) #endif /* @@ -469,10 +470,24 @@ extern pte_t *lookup_address(unsigned lo #endif #if defined(CONFIG_HIGHPTE) -#define pte_offset_map(dir, address) \ - ((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + pte_index(address)) -#define pte_offset_map_nested(dir, address) \ - ((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + pte_index(address)) +#define pte_offset_map(dir, address) \ +({ \ + pte_t *__ptep; \ + unsigned pfn = pmd_val(*(dir)) >> PAGE_SHIFT; \ + __ptep = (pte_t *)kmap_atomic(pfn_to_page(pfn),KM_PTE0);\ + paravirt_map_pt_hook(KM_PTE0,__ptep, pfn); \ + __ptep
Re: + extend-print_symbol-capability.patch added to -mm tree
On Thu, 01 Mar 2007 18:17:56 -0800 [EMAIL PROTECTED] wrote: > Today's print_symbol function dumps a kernel symbol with printk. This > patch extends the functionality of kallsyms.c so that the symbol lookup > function may be used without the printk. This is useful for modules that > want to dump symbols elsewhere, for example, to debugfs. I intend to use > the new function call in the GFS2 file system (which will be a separate > patch). Hey, I've needed this one in the past. Thanks. > --- > > include/linux/kallsyms.h | 10 ++ > kernel/kallsyms.c| 21 ++--- > 2 files changed, 24 insertions(+), 7 deletions(-) > > diff -puN kernel/kallsyms.c~extend-print_symbol-capability kernel/kallsyms.c > --- a/kernel/kallsyms.c~extend-print_symbol-capability > +++ a/kernel/kallsyms.c > @@ -288,6 +285,15 @@ void __print_symbol(const char *fmt, uns > else > sprintf(buffer, "%s+%#lx/%#lx", name, offset, size); > } > +} > + > +/* Replace "%s" in format with address, or returns -errno. */ Please fix the comment above... > +void __print_symbol(const char *fmt, unsigned long address) > +{ > + char buffer[KSYM_SYMBOL_LEN]; > + > + lookup_symbol(address, buffer); > + > printk(fmt, buffer); > } --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/9] Vmi cpu cycles.patch
In order to share the common code in tsc.c which does CPU Khz calibration, we need to make an accurate value of CPU speed available to the tsc.c code. This value loses a lot of precision in a VM because of the timing differences with real hardware, but we need it to be as precise as possible so the guest can make accurate time calculations with the cycle counters. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r b8b315c897bb arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.cTue Feb 27 14:04:43 2007 -0800 +++ b/arch/i386/kernel/vmi.cTue Feb 27 14:06:46 2007 -0800 @@ -874,6 +874,7 @@ static inline int __init activate_vmi(vo paravirt_ops.setup_secondary_clock = vmi_timer_setup_secondary_alarm; #endif paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles; + paravirt_ops.get_cpu_khz = vmi_cpu_khz; } if (!disable_noidle) para_fill(safe_halt, Halt); diff -r b8b315c897bb arch/i386/kernel/vmitime.c --- a/arch/i386/kernel/vmitime.cTue Feb 27 14:04:43 2007 -0800 +++ b/arch/i386/kernel/vmitime.cTue Feb 27 14:06:46 2007 -0800 @@ -177,6 +177,15 @@ unsigned long long vmi_get_sched_cycles( return read_available_cycles(); } +unsigned long vmi_cpu_khz(void) +{ + unsigned long long khz; + + khz = vmi_timer_ops.get_cycle_frequency(); + (void)do_div(khz, 1000); + return khz; +} + void __init vmi_time_init(void) { unsigned long long cycles_per_sec, cycles_per_msec; @@ -206,7 +215,6 @@ void __init vmi_time_init(void) (void)do_div(cycles_per_alarm, alarm_hz); cycles_per_msec = cycles_per_sec; (void)do_div(cycles_per_msec, 1000); - cpu_khz = cycles_per_msec; printk(KERN_WARNING "VMI timer cycles/sec = %llu ; cycles/jiffy = %llu ;" "cycles/alarm = %llu\n", cycles_per_sec, cycles_per_jiffy, diff -r b8b315c897bb include/asm-i386/vmi_time.h --- a/include/asm-i386/vmi_time.h Tue Feb 27 14:04:43 2007 -0800 +++ b/include/asm-i386/vmi_time.h Tue Feb 27 14:06:46 2007 -0800 @@ -50,6 +50,7 @@ extern unsigned long vmi_get_wallclock(v extern unsigned long vmi_get_wallclock(void); extern int vmi_set_wallclock(unsigned long now); extern unsigned long long vmi_get_sched_cycles(void); +extern unsigned long vmi_cpu_khz(void); #ifdef CONFIG_X86_LOCAL_APIC extern void __init vmi_timer_setup_boot_alarm(void); diff -r b8b315c897bb arch/i386/kernel/paravirt.c --- a/arch/i386/kernel/paravirt.c Tue Feb 27 14:04:43 2007 -0800 +++ b/arch/i386/kernel/paravirt.c Tue Feb 27 14:08:59 2007 -0800 @@ -522,6 +522,7 @@ struct paravirt_ops paravirt_ops = { .read_tsc = native_read_tsc, .read_pmc = native_read_pmc, .get_scheduled_cycles = native_read_tsc, + .get_cpu_khz = native_calculate_cpu_khz, .load_tr_desc = native_load_tr_desc, .set_ldt = native_set_ldt, .load_gdt = native_load_gdt, diff -r b8b315c897bb arch/i386/kernel/tsc.c --- a/arch/i386/kernel/tsc.cTue Feb 27 14:04:43 2007 -0800 +++ b/arch/i386/kernel/tsc.cTue Feb 27 14:09:23 2007 -0800 @@ -117,7 +117,7 @@ unsigned long long sched_clock(void) return cycles_2_ns(this_offset); } -static unsigned long calculate_cpu_khz(void) +unsigned long native_calculate_cpu_khz(void) { unsigned long long start, end; unsigned long count; diff -r b8b315c897bb include/asm-i386/paravirt.h --- a/include/asm-i386/paravirt.h Tue Feb 27 14:04:43 2007 -0800 +++ b/include/asm-i386/paravirt.h Tue Feb 27 14:10:25 2007 -0800 @@ -95,6 +95,7 @@ struct paravirt_ops u64 (*read_tsc)(void); u64 (*read_pmc)(void); u64 (*get_scheduled_cycles)(void); + unsigned long (*get_cpu_khz)(void); void (*load_tr_desc)(void); void (*load_gdt)(const struct Xgt_desc_struct *); @@ -275,6 +276,7 @@ static inline void halt(void) #define rdtscll(val) (val = paravirt_ops.read_tsc()) #define get_scheduled_cycles(val) (val = paravirt_ops.get_scheduled_cycles()) +#define calculate_cpu_khz() (paravirt_ops.get_cpu_khz()) #define write_tsc(val1,val2) wrmsr(0x10, val1, val2) diff -r b8b315c897bb include/asm-i386/timer.h --- a/include/asm-i386/timer.h Tue Feb 27 14:04:43 2007 -0800 +++ b/include/asm-i386/timer.h Tue Feb 27 14:11:35 2007 -0800 @@ -7,6 +7,7 @@ void setup_pit_timer(void); unsigned long long native_sched_clock(void); +unsigned long native_calculate_cpu_khz(void); /* Modifiers for buggy PIT handling */ extern int pit_latch_buggy; @@ -17,6 +18,7 @@ extern int recalibrate_cpu_khz(void); #ifndef CONFIG_PARAVIRT #define get_scheduled_cycles(val) rdtscll(val) +#define calculate_cpu_khz() native_calculate_cpu_khz() #endif #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www
[PATCH 1/9] Vmi timer fixes round two.patch
Critical bugfixes for the VMI-Timer code. 1) Do not setup a one shot alarm if we are keeping the periodic alarm armed. Additionally, since the periodic alarm can be run at a lower rate than HZ, let's fixup the guard to the no-idle-hz mode appropriately. This fixes the bug where the no-idle-hz mode might have a higher interrupt rate than the non-idle case. 2) The interrupt handler can no longer adjust xtime due to nested lock acquisition. Drop this. We don't need to check for wallclock time at every tick, it can be done in userspace instead. 3) Add a bypass to disable noidle operation. This is useful as a last minute workaround, or testing measure. 4) The code to skip the IO_APIC timer testing (no_timer_check) should be conditional on IO_APIC, not SMP, since UP kernels can have this configured in as well. Signed-off-by: Dan Hecht <[EMAIL PROTECTED]> Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r f62ebe3ba01c arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.cTue Feb 27 14:01:28 2007 -0800 +++ b/arch/i386/kernel/vmi.cTue Feb 27 14:12:46 2007 -0800 @@ -54,6 +54,7 @@ static int disable_sep; static int disable_sep; static int disable_tsc; static int disable_mtrr; +static int disable_noidle; /* Cached VMI operations */ struct { @@ -255,7 +256,6 @@ static void vmi_nop(void) } /* For NO_IDLE_HZ, we stop the clock when halting the kernel */ -#ifdef CONFIG_NO_IDLE_HZ static fastcall void vmi_safe_halt(void) { int idle = vmi_stop_hz_timer(); @@ -266,7 +266,6 @@ static fastcall void vmi_safe_halt(void) local_irq_enable(); } } -#endif #ifdef CONFIG_DEBUG_PAGE_TYPE @@ -742,12 +741,7 @@ static inline int __init activate_vmi(vo (char *)paravirt_ops.save_fl); patch_offset(&irq_save_disable_callout[IRQ_PATCH_DISABLE], (char *)paravirt_ops.irq_disable); -#ifndef CONFIG_NO_IDLE_HZ - para_fill(safe_halt, Halt); -#else - vmi_ops.halt = vmi_get_function(VMI_CALL_Halt); - paravirt_ops.safe_halt = vmi_safe_halt; -#endif + para_fill(wbinvd, WBINVD); /* paravirt_ops.read_msr = vmi_rdmsr */ /* paravirt_ops.write_msr = vmi_wrmsr */ @@ -881,6 +875,12 @@ static inline int __init activate_vmi(vo #endif custom_sched_clock = vmi_sched_clock; } + if (!disable_noidle) + para_fill(safe_halt, Halt); + else { + vmi_ops.halt = vmi_get_function(VMI_CALL_Halt); + paravirt_ops.safe_halt = vmi_safe_halt; + } /* * Alternative instruction rewriting doesn't happen soon enough @@ -914,9 +914,11 @@ void __init vmi_init(void) local_irq_save(flags); activate_vmi(); -#ifdef CONFIG_SMP + +#ifdef CONFIG_X86_IO_APIC no_timer_check = 1; #endif + local_irq_restore(flags & X86_EFLAGS_IF); } @@ -942,7 +944,8 @@ static int __init parse_vmi(char *arg) } else if (!strcmp(arg, "disable_mtrr")) { clear_bit(X86_FEATURE_MTRR, boot_cpu_data.x86_capability); disable_mtrr = 1; - } + } else if (!strcmp(arg, "disable_noidle")) + disable_noidle = 1; return 0; } diff -r f62ebe3ba01c arch/i386/kernel/vmitime.c --- a/arch/i386/kernel/vmitime.cTue Feb 27 14:01:28 2007 -0800 +++ b/arch/i386/kernel/vmitime.cTue Feb 27 14:12:01 2007 -0800 @@ -276,15 +276,12 @@ static void vmi_account_real_cycles(unsi cycles_not_accounted = cur_real_cycles - real_cycles_accounted_system; while (cycles_not_accounted >= cycles_per_jiffy) { - /* systems wide jiffies and wallclock. */ + /* systems wide jiffies. */ do_timer(1); cycles_not_accounted -= cycles_per_jiffy; real_cycles_accounted_system += cycles_per_jiffy; } - - if (vmi_timer_ops.wallclock_updated()) - update_xtime_from_wallclock(); write_sequnlock(&xtime_lock); } @@ -380,7 +377,6 @@ int vmi_stop_hz_timer(void) unsigned long seq, next; unsigned long long real_cycles_expiry; int cpu = smp_processor_id(); - int idle; BUG_ON(!irqs_disabled()); if (sysctl_hz_timer != 0) @@ -388,13 +384,13 @@ int vmi_stop_hz_timer(void) cpu_set(cpu, nohz_cpu_mask); smp_mb(); + if (rcu_needs_cpu(cpu) || local_softirq_pending() || - (next = next_timer_interrupt(), time_before_eq(next, jiffies))) { + (next = next_timer_interrupt(), +time_before_eq(next, jiffies + HZ/CONFIG_VMI_ALARM_HZ))) { cpu_clear(cpu, nohz_cpu_mask); - next = jiffies; - idle = 0; - } else - idle = 1; + return 0; + } /* Convert jiffies to the real cycle counter. */ do { @@ -404,17 +400,13 @@ int vmi_stop_hz_timer(void) } while (read_seqretry(&xtime_lock,
[PATCH 2/9] Sched clock paravirt op fix.patch
The custom_sched_clock hook is broken. The result from sched_clock needs to be in nanoseconds, not in CPU cycles. The TSC is insufficient for this purpose, because TSC is poorly defined in a virtual environment, and mostly represents real world time instead of scheduled process time (which can be interrupted without notice when a virtual machine is descheduled). To make the scheduler consistent, we must expose a different nature of time, that is scheduled time. So deprecate this custom_sched_clock hack and turn it into a paravirt-op, as it should have been all along. This allows the tsc.c code which converts cycles to nanoseconds to be shared by all paravirt-ops backends. It is unfortunate to add a new paravirt-op, but this is a very distinct abstraction which is clearly different for all virtual machine implementations, and it gets rid of an ugly indirect function which I ashamedly admit I hacked in to try to get this to work earlier, and then even got in the wrong units. Please apply. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff -r d58e6ddfdfa9 arch/i386/kernel/paravirt.c --- a/arch/i386/kernel/paravirt.c Thu Feb 15 23:52:41 2007 -0800 +++ b/arch/i386/kernel/paravirt.c Fri Feb 16 00:04:39 2007 -0800 @@ -32,6 +32,7 @@ #include #include #include +#include /* nop stub */ static void native_nop(void) @@ -520,6 +521,7 @@ struct paravirt_ops paravirt_ops = { .write_msr = native_write_msr, .read_tsc = native_read_tsc, .read_pmc = native_read_pmc, + .get_scheduled_cycles = native_read_tsc, .load_tr_desc = native_load_tr_desc, .set_ldt = native_set_ldt, .load_gdt = native_load_gdt, diff -r d58e6ddfdfa9 arch/i386/kernel/tsc.c --- a/arch/i386/kernel/tsc.cThu Feb 15 23:52:41 2007 -0800 +++ b/arch/i386/kernel/tsc.cFri Feb 16 00:06:34 2007 -0800 @@ -14,6 +14,7 @@ #include #include #include +#include #include "mach_timer.h" @@ -108,9 +109,6 @@ unsigned long long sched_clock(void) { unsigned long long this_offset; - if (unlikely(custom_sched_clock)) - return (*custom_sched_clock)(); - /* * Fall back to jiffies if there's no TSC available: */ @@ -119,7 +117,7 @@ unsigned long long sched_clock(void) return (jiffies_64 - INITIAL_JIFFIES) * (10 / HZ); /* read the Time Stamp Counter: */ - rdtscll(this_offset); + get_scheduled_cycles(this_offset); /* return the value in ns */ return cycles_2_ns(this_offset); diff -r d58e6ddfdfa9 arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.cThu Feb 15 23:52:41 2007 -0800 +++ b/arch/i386/kernel/vmi.cFri Feb 16 00:02:48 2007 -0800 @@ -873,7 +873,7 @@ static inline int __init activate_vmi(vo paravirt_ops.setup_boot_clock = vmi_timer_setup_boot_alarm; paravirt_ops.setup_secondary_clock = vmi_timer_setup_secondary_alarm; #endif - custom_sched_clock = vmi_sched_clock; + paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles; } if (!disable_noidle) para_fill(safe_halt, Halt); diff -r d58e6ddfdfa9 arch/i386/kernel/vmitime.c --- a/arch/i386/kernel/vmitime.cThu Feb 15 23:52:41 2007 -0800 +++ b/arch/i386/kernel/vmitime.cFri Feb 16 00:02:48 2007 -0800 @@ -172,7 +172,7 @@ int vmi_set_wallclock(unsigned long now) return -1; } -unsigned long long vmi_sched_clock(void) +unsigned long long vmi_get_sched_cycles(void) { return read_available_cycles(); } diff -r d58e6ddfdfa9 include/asm-i386/paravirt.h --- a/include/asm-i386/paravirt.h Thu Feb 15 23:52:41 2007 -0800 +++ b/include/asm-i386/paravirt.h Fri Feb 16 00:07:22 2007 -0800 @@ -94,6 +94,7 @@ struct paravirt_ops u64 (*read_tsc)(void); u64 (*read_pmc)(void); + u64 (*get_scheduled_cycles)(void); void (*load_tr_desc)(void); void (*load_gdt)(const struct Xgt_desc_struct *); @@ -273,6 +274,8 @@ static inline void halt(void) #define rdtscll(val) (val = paravirt_ops.read_tsc()) +#define get_scheduled_cycles(val) (val = paravirt_ops.get_scheduled_cycles()) + #define write_tsc(val1,val2) wrmsr(0x10, val1, val2) #define rdpmc(counter,low,high) do { \ diff -r d58e6ddfdfa9 include/asm-i386/time.h --- a/include/asm-i386/time.h Thu Feb 15 23:52:41 2007 -0800 +++ b/include/asm-i386/time.h Fri Feb 16 00:02:48 2007 -0800 @@ -30,7 +30,6 @@ static inline int native_set_wallclock(u #ifdef CONFIG_PARAVIRT #include -extern unsigned long long native_sched_clock(void); #else /* !CONFIG_PARAVIRT */ #define get_wallclock() native_get_wallclock() diff -r d58e6ddfdfa9 include/asm-i386/timer.h --- a/include/asm-i386/timer.h Thu Feb 15 23:52:41 2007 -0800 +++ b/include/asm-i386/timer.h Fri Feb 16 00:05:13 2007 -0800 @@ -4,13 +4,19 @@ #include #define TICK_SIZE (tick_nsec / 1000) + void setup_pit_timer(v
[PATCH 0/9] Bugfix patches for i386/vmi/paravirt-ops
Andi, Linus, we have some critical bugfixes for the VMI paravirt-ops code. Please apply. If there are objections to certain pieces, they can be reworked, but they are pretty much all needed for correctness. We are hoping to get these in the next 2.6.21-rc release. We had quite a few difficulties debugging after the integration of the hrtimers code, which is why this took so long. Andrew, add you on the list in case any further hrtimers integration issues pop up. Thanks, Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Is the clockevent resolution fine-grained enough?
It would appear the new clockevent API has a one-nanosecond resolution. It certainly looks sufficiently fine-grained, but I'm afraid it's too coarse for some applications. In our application, we need periodic clock interrupts at about 100 kHz. If the (programmable) frequency must be rounded to the nearest nanosecond, we have a cumulative error of 100,000 * 0.5 ns/s = 50 µs/s We need to maintain the cumulative error within, say, 1 ms/day, or 11 ns/s. (The error is not measured against real time, but between different parts of our hardware that are run off of the same clock.) For our needs, we have built our own "clockevent" system that has a nominal one-femtosecond precision. The nanosecond resolution would be sufficient if there was a way to "nudge" the next interrupt by a nanosecond from the interrupt handler. Marko -- Marko Rauhamaa mailto:[EMAIL PROTECTED] http://pacujo.net/marko/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc2 radeon backlight
On Wed, 28 Feb 2007 08:32:43 -0800 Alex Romosan <[EMAIL PROTECTED]> wrote: > the backlight on my thinkpad still (2.6.20 worked fine) doesn't come > on if i have the radeon backlight enabled. without it, i guess it's > the ibm acpi modules that controls the backlight and it seems to work > fine. > Unclear. Are you saying that the backlight comes on OK if you use the IBM acpi module? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/