Kernel traces coming back with trash/clutter
I am experimenting with the kernel (CentOSv4.4 x86_64, 2.6.9-42.0.10) and I have added a number of traces in some relatively sensitive code in the page cache and some i/o functions. I am getting this odd content in the trace log (dmesg), and I cannot figure out what it is or why it is there. 4296757675 pdflush(80): do_writepages: mapopswrtpgs a0195ff5 4296757675 pdflush(80): mpage_writepages w/b index 49728 pages 256000 7 7 7 7 7__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max 802525d8 generic_make_request(bio 01017c745300) 50729472, 704 __make_request(q 0101b9293870, bio 01017c745300: sdc; 50729600, 704) ll_new_hw_segment: 70 + 29 88 7 7 7 7 __bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max 802525d8 generic_make_request(bio 01017c745a80) 50730176, 704 __make_request(q 0101b9293870, bio 01017c745a80: sdc; 50730304, 704) 4296757684 swapper(0): dl_mv2dsp: sdc start 50710368 secs 1408 (The lines with the 7s in them are long - I wrapped them for ease of reading and to keep the width down somewhat.) Any feedback that might illuminate this would be welcome. Please CC me personally as I am not yet able to subscribe to this list (apologies). Thanks. -- Mark Hull-Richter, Linux Kernel Engineer DATAllegro (www.datallegro.com) 85 Enterprise, Second Floor, Aliso Viejo, CA 92656 949-680-3082 - Office 949-330-7691 - fax - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
On Tue, 24 Apr 2007, Pavel Machek wrote: If the code just moved somewhere else, it's not less code. It is not just moved. It is in userspace, where we can use liblzf / gcrypt / ( and vbetool for s2ram/s2both) as libraries. We have about 7000 LoC of userland code (that is not libraries). If it's in user land, we also have - communication difficulties between two parts, and all the *crap* that tends to entail (ie legacy interfaces forever, and upgrading one without the other etc) - people who work on the kernel part are working blind (ie they are at the mercy of whatever userland does, and it's not a contained subsystem). This just ends up becoming worse when you then interact with ten different versions of the user-land stuff, thanks to small tweaks by five different vendors, and a hundred random people. And don't tell me that doesn't happen. Maybe it doesn't happen _now_, because people who use it all get the patches from one place, but the moment we start talking about integration into the standard kernel, that means that the kernel needs to work regardless of whether somebody uses SuSE, RH, Fedora, Ubuntu or cooked his own distro entirely using some development version of the suspend user-space tools. This is why I don't believe in the whole kernel-line-counting thing. I'm personally 100% convinced that it's better to have ten times as many lines in the kernel, if it means that you can just forget about version skew and bad user-space interfaces etc. So if you want to enumerate good points, you'd damn well also face the _problems_. This is why there's a lot to be said for echo mem /sys/power/state and being able to follow the path through _one_ object (the kernel) over trying to figure out the interaction between many different parts with different versions. I believe uswsusp user/kernel separation is clean enough. Kernel provides snapshot image and resume image. (Thanks go to Rafael for very clean interface). Now, *that* is the kind of argument that matters. Quite frankly, if you want to convince me, it's not by lines of kernel code, but by talking about easy-to-understand interfaces that actuually do one thing and do it well (and by one thing, I mean one _whole_ thing). Because I care a lot less about lines of code than about maintainable interfaces that people can think about and debug. I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the whole thing. I think they've _all_ caused problems for the true suspend (suspend-to-ram), and the last thing I want to see is three or four different suspend-to-disk implementations. So unlike Ingo, I don't think let's just integrate them all side-by-side and maintain them and look who wins is really a good idea. How many different magic ioctl's does the thing introduce? Is it really just *two* entry-points (and how simple are they, interface-wise), and nothing else? Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: old ISA DMA bug in 2.6.12?
Bob Tracy wrote: I was enjoying yet another session of beating my head against the wall trying to do useful things with old hardware :-), and managed to cause a kernel panic by simply trying to mount a cdrom in the context of a DSL-N installation. The SCSI host adapter is an Adaptec AHA-1542B, and when I try to mount a cdrom, I manage to run afoul of the BAD_DMA() check in aha1542.c: the buffer returned is not in the lower 16 MB of memory. The same 2.6.12 kernel + hardware combination works fine as long as I confine my I/O to the hard disk that's also attached to the AHA-1542B. Looks like the aha1542 driver doesn't set the DMA mask, so the kernel will default to thinking it can do 32-bit DMA when it should be 24-bit. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel traces coming back with trash/clutter
I am experimenting with the kernel (CentOSv4.4 x86_64, 2.6.9-42.0.10) and I have added a number of traces in some relatively sensitive code in the page cache and some i/o functions. I am getting this odd content in the trace log (dmesg), and I cannot figure out what it is or why it is there. 4296757675 pdflush(80): do_writepages: mapopswrtpgs a0195ff5 4296757675 pdflush(80): mpage_writepages w/b index 49728 pages 256000 7 7 7 7 7__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max 802525d8 generic_make_request(bio 01017c745300) 50729472, 704 __make_request(q 0101b9293870, bio 01017c745300: sdc; 50729600, 704) ll_new_hw_segment: 70 + 29 88 7 7 7 7 __bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max 802525d8 generic_make_request(bio 01017c745a80) 50730176, 704 __make_request(q 0101b9293870, bio 01017c745a80: sdc; 50730304, 704) 4296757684 swapper(0): dl_mv2dsp: sdc start 50710368 secs 1408 (The lines with the 7s in them are long - I wrapped them for ease of reading and to keep the width down somewhat.) Any feedback that might illuminate this would be welcome. Please CC me personally as I am not yet able to subscribe to this list (apologies). 7 is KERN_DEBUG in include/linux/kernel.h, used with printk. Are you using printk in the following forms? printk(KERN_DEBUG A debug message.\n); ...or... const char msg_debug[] = KERN_DEBUG A debug message.\n; printk(msg_debug); Perhaps you have something looping that's outputting KERN_DEBUG with a null message? Or one of your diagnostic printk statements includes KERN_DEBUG with no actual message? Remember, if you have a string in a variable without a KERN_* prependation, you can do this. printk(KERN_DEBUG %s\n, debug_message); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] use mutex instead of semaphore in RocketPort driver
Matthias Kaehlcke wrote: El Tue, Apr 24, 2007 at 07:53:04PM +0200 Oliver Neukum ha dit: Am Dienstag, 24. April 2007 19:49 schrieb Matthias Kaehlcke: @@ -1706,7 +1706,7 @@ static int rp_write(struct tty_struct *tty, if (count = 0 || rocket_paranoia_check(info, rp_write)) return 0; - down_interruptible(info-write_sem); + mutex_lock_interruptible(info-write_mtx); This is a bug. It is also present in the current code, but nevertheless it is a bug. If you use an interruptible lock, you must be ready to deal with interrupts, which are ignored by this code. i fear i don't have the experience/knowledge to fix this bug, thanks for your remark. i'm a bit confused now about the interruptible locks, i thought using them means that the process will be waked up when receiving a signal. what role are playing interrupts when using interruptible locks? You are correct, interrupts aren't involved. However if the wait is interrupted by a signal, mutex_lock_interruptible will return a nonzero return code which needs to be checked for (and likely -ERESTARTSYS or -EINTR returned), otherwise the code will blindly continue as though it has locked the mutex even though it has not. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Rogan Dawes wrote: Chris Friesen wrote: Rogan Dawes wrote: I guess my point was if we somehow get to an odd number of nanoseconds, we'd end up with rounding errors. I'm not sure if your algorithm will ever allow that. And Ingo's point was that when it takes thousands of nanoseconds for a single context switch, an error of half a nanosecond is down in the noise. Chris My concern was that since Ingo said that this is a closed economy, with a fixed sum/total, if we lose a nanosecond here and there, eventually we'll lose them all. Some folks have uptimes of multiple years. Of course, I could (very likely!) be full of it! ;-) And won't be using the any new scheduler on these computers anyhow as that would involve bringing the system down to install the new kernel. :-) Peter -- Peter Williams [EMAIL PROTECTED] Learning, n. The kind of ignorance distinguishing the studious. -- Ambrose Bierce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel traces coming back with trash/clutter
On 4/24/07, John Anthony Kazos Jr. [EMAIL PROTECTED] wrote: I am getting this odd content in the trace log (dmesg), and I cannot figure out what it is or why it is there. 7 7 7 7 7__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max 802525d8 generic_make_request(bio 01017c745300) 50729472, 704 7 is KERN_DEBUG in include/linux/kernel.h, used with printk. Are you using printk in the following forms? printk(KERN_DEBUG A debug message.\n); Yes, exclusively. Perhaps you have something looping that's outputting KERN_DEBUG with a null message? Or one of your diagnostic printk statements includes KERN_DEBUG with no actual message? No, they are all KERN_DEBUGspacesome string here, almost all with some formatted output as well. Could I be overloading the printk output buffer, as in possibly too tightly repeated/looped code to be able to output it all? Remember, if you have a string in a variable without a KERN_* prependation, you can do this. printk(KERN_DEBUG %s\n, debug_message); Haven't tried that one - they're all of the form above. Thanks again. -- Mark Hull-Richter, Linux Kernel Engineer DATAllegro (www.datallegro.com) 85 Enterprise, Second Floor, Aliso Viejo, CA 92656 949-680-3082 - Office 949-330-7691 - fax [This message is NOT SPAM and is sent in strict accordance with Google, Yahoo, AOL, Netscape and Earthlink Terms of Service. If you are NOT receiving this through a group and do not want any more emails from me, please reply to me and let me know. If you are receiving this second-hand, this sender disclaims all responsibility for your response.] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cpufreq: allow full selection of default governors
On Tue, Apr 24, 2007 at 03:05:36PM -0700, Nish Aravamudan wrote: On 4/24/07, Dave Jones [EMAIL PROTECTED] wrote: On Tue, Apr 24, 2007 at 09:03:23PM +, William Heimbigner wrote: The following patches should allow selection of conservative, powersave, and ondemand in the kernel configuration. This has been rejected several times already. Ondemand and conservative isn't a viable governor for all cpufreq implementations (ie, ones with high switching latencies). This piques my curiosity -- some governors don't work with some cpufreq implementations. Are those implementations in the kernel or in userspace? If in the kernel, then perhaps there should be some dependency expressed there in Kconfig between cpufreq implementation and the available governors it can't be solved that easily. powernow-k8 for example is fine to use with ondemand on newer systems, where the latency is low. On older models however, it isn't. Also, see the comment in the Kconfig a few lines above where you are adding this. Are these governors unfixable? If tbh, I've forgotten the original issues that caused the comment to be placed there. Dominik ? Just looking for more info -- feel free to just point me at the archives. cpufreq-list archives are at http://lists.linux.org.uk/mailman/listinfo/cpufreq (though only available to list members) Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)
On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote: The following extra security measures are taken for unprivileged mounts: - usermounts are limited by a sysctl tunable - force nosuid,nodev mount options on the created mount The original userspace user= solution also implies the noexec option by default (you can override the default by exec option). It means the kernel based solution is not fully compatible ;-( Karel -- Karel Zak [EMAIL PROTECTED] Red Hat Czech s.r.o. Purkynova 99/71, 612 45 Brno, Czech Republic Reg.id: CZ27690016 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about Reiser4
On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper [EMAIL PROTECTED] said: I did. That whole thread is some guy spouting off a ludicrous Bonnie++ benchmark showing that compressing long strings of 0s results in things taking up very little space and being very fast. I think you are deliberately being stupid here. You are claiming that REISER4's good speed results when using compression actually has a simple explanation and THEREFORE all good result for the filesystem, even those results that have nothing to do with compression, are negated. NOTHING COULD BE FURTHER FROM THE TRUTH. Your conclusion is a total travesty of logic. As I understand it, the default Reiser4 DOES NOT USE any compression at all, not even tail compression, but saves space by eliminating block alignment wastage (tail compression is an option). So lets LOSE the statistics that involve compression. The results now look like this: .-. | FILESYSTEM | TIME |DISK | | TYPE |(secs)|USAGE| .-. |REISER4 | 3462 | 692 | |EXT2| 4092 | 816 | |JFS | 4225 | 806 | |EXT4| 4408 | 816 | |EXT3| 4421 | 816 | |XFS | 4625 | 779 | |REISER3 | 6178 | 793 | |FAT32 |12342 | 988 | |NTFS-3g |10414 | 772 | .-. These results are still EXTREMELY GOOD for REISER4. These results still say that Reiser4 is a truly remarkable filesystem, as stated in: http://linuxhelp.150m.com/resources/fs-benchmarks.htm http://m.domaindlx.com/LinuxHelp/fs-benchmarks.htm So why do I see an anti-Reiser religion, in all that you people say. You, concentrate on the fact that bonnie++'s use of files that are mainly zeroes, will make the results using compression less good than they are. I can't see anywhere where this has been denied. In fact the other set of statistics that you just ignore, states that in more realistic situations, the compression speedup is slightly negative. What is wrong here, is: You say that the Bonnie++ tests using compression are subject to interpretation. No argument here. You ignore the tests that confirm your statement. You are clearly not interested in the actual results or their interpretation. You, by some incredibly twisted logic the state that Reiser4 is therefore not good, even though it is clearly the best filesystem when NOT using compression. This of course is completely deceitful logic. That the speed advantage from compression would be small is clear from the OTHER data that you ignore, namely: .-. |File |Disk |Copy |Copy |Tar |Unzip| Del | |System |Usage|655MB|655MB|Gzip |UnTar| 2.5 | |Type | (MB)| (1) | (2) |655MB|655MB| Gig | .-. |REISER4 lzo | 278 | 138 | 56 | 80 | 34 | 84 | |REISER4 gzip | 213 | 148 | 68 | 83 | 48 | 70 | |REISER4 | 692 | 148 | 55 | 67 | 25 | 56 | |EXT4 | 816 | 174 | 70 | 74 | 42 | 50 | .-. So, the speed increase with compression (on very compressible kernel sources) is slightly negative, but the speed is still comparable to that of EXT4. On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper [EMAIL PROTECTED] said: I know that this whole effort has been put in disarray by the prosecution of Hans Reiser, but I'm curious as to its status. Is Reiser4 going to be going into the Linus kernel anytime soon? Is there somewhere I should be looking to find this out without wasting bandwidth here? There was a thread the other day, that talked about Reiser4. It took a while but I have found it (actually two) http://lkml.org/lkml/2007/4/5/360 http://lkml.org/lkml/2007/4/9/4 You may want to check them out. I did. That whole thread is some guy spouting off a ludicrous Bonnie++ benchmark showing that compressing long strings of 0s results in things taking up very little space and being very fast. Such things will produce lots of flames and no useful information whatsoever as is evinced by the half conspiracy theory, half truth the thread degenerated into in the second message you linked to. -- [EMAIL PROTECTED] -- http://www.fastmail.fm - mmm... Fastmail... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression with gammu on 2.6.21-rc7
On Mon, Apr 23, 2007 at 10:10:22AM +0200, Wolfgang Erig wrote: Hello Greg, Please don't take me out of the cc:, otherwise I might mist this (as I did...) On Sun, Apr 22, 2007 at 10:47:17PM -0700, Greg KH wrote: On Fri, Apr 20, 2007 at 10:58:53AM +0200, Wolfgang Erig wrote: Hello, I have a regression with 2.6.21-rc7-g80d74d51. The utility gammu to talk to my mobile does not work anymore. With 2.6.20 gammu runs fine. Distribution is the latest Debian/testing Wolfgang $ gammu --backup backup Press Ctrl+C to break... I/O possible the problem is here because gammu stops working. Maybe a problem in gammu, but with 2.6.20 gammu works fine. $ uname -a Linux max 2.6.21-rc7-g80d74d51 #9 SMP Wed Apr 18 21:41:41 CEST 2007 i686 GNU/Linux $ tail messages Apr 20 08:04:36 max kernel: ACPI: PCI Interrupt :00:1b.0[A] - GSI 16 (level, low) - IRQ 16 Apr 20 08:04:36 max kernel: extern: link up, 100Mbps, full-duplex, lpa 0x45E1 Apr 20 08:04:36 max kernel: intern: setting half-duplex. Apr 20 08:09:02 max kernel: usb 2-2: USB disconnect, address 3 Apr 20 08:09:02 max kernel: pl2303 ttyUSB0: pl2303 converter now disconnected from ttyUSB0 Apr 20 08:09:02 max kernel: pl2303 2-2:1.0: device disconnected Apr 20 08:10:24 max kernel: usb 2-2: new full speed USB device using uhci_hcd and address 4 Apr 20 08:10:25 max kernel: usb 2-2: configuration #1 chosen from 1 choice Apr 20 08:10:25 max kernel: pl2303 2-2:1.0: pl2303 converter detected Apr 20 08:10:25 max kernel: usb 2-2: pl2303 converter now attached to ttyUSB0 That looks ok, I'm guessing you yanked it out and then back in? Yes. This is included only to see which device is connected. Or is the problem that the device was removed? No, no problem with removal. I see no hint for a problem in the usb-layer. I don't see any problems here. If you enable debugging in the pl2303 driver, do you get any errors? You can do this by: modprobe pl2303 debug=1 or if the module is built in or already loaded: echo 1 /sys/modules/pl2303/parameters/debug Also, if you know how to use git, doing a 'git bisect' to try to track down the problem commit would be very helpful. thanks, greg k-h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/3] 2.6.21-rc7: known regressions (v2)
On Tue, Apr 24, 2007 at 11:32:53AM +0200, Wolfgang Erig wrote: On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote: On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote: This email lists some known regressions in Linus' tree compared to 2.6.20. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: gammu no longer works References : http://lkml.org/lkml/2007/4/20/84 Submitter : Wolfgang Erig [EMAIL PROTECTED] Status : unknown I've asked for more information about this, and so far am not sure it's a real problem. It is a real problem for me. I tried this on 2 different boxes with the same behaviour. No sync between my Nokia mobile and Linux with the latest kernel :( Sorry, I didn't see your response, have followed up on lkml now. thanks, greg k-h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ChunkFS - measuring cross-chunk references
On 4/24/07, Theodore Tso [EMAIL PROTECTED] wrote: On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote: . It would also be good to distinguish between directories referencing files in another chunk, and directories referencing subdirectories in another chunk (which would be simpler to handle, given the topological restrictions on directories, as compared to files and hard links). Modified the tool to distinguish between 1. cross references between directories and files 2. cross references between directories and sub directories 3. cross references within a file (due to huge file size) Below is the result from / partition of ext3 file system: Number of files = 221794 Number of directories = 24457 Total size = 8193116 KB Total data stored = 7187392 KB Size of block groups = 131072 KB Number of inodes per block group = 16288 No. of cross references between directories and sub-directories = 7791 No. of cross references between directories and file = 657 Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570) Thanks for the suggestions. There may also be special things we will need to do to handle scenarios such as BackupPC, where if it looks like a directory contains a huge number of hard links to a particular chunk, we'll need to make sure that directory is either created in the right chunk (possibly with hints from the application) or migrated to the right chunk (but this might cause the inode number of the directory to change --- maybe we allow this as long as the directory has never been stat'ed, so that the inode number has never been observed). The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow it on 64-bit systems, or we need to consider a migration so that even on 32-bit platforms, stat() functions like stat64(), insofar that it uses a stat structure which returns a 64-bit ino_t. - Ted Thanks, Karuna cref.tar.bz2 Description: BZip2 compressed data
Re: [1/3] 2.6.21-rc7: known regressions (v2)
On Tue, Apr 24, 2007 at 05:14:28PM -0700, Greg KH wrote: On Tue, Apr 24, 2007 at 11:32:53AM +0200, Wolfgang Erig wrote: On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote: On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote: This email lists some known regressions in Linus' tree compared to 2.6.20. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: gammu no longer works References : http://lkml.org/lkml/2007/4/20/84 Submitter : Wolfgang Erig [EMAIL PROTECTED] Status : unknown I've asked for more information about this, and so far am not sure it's a real problem. It is a real problem for me. I tried this on 2 different boxes with the same behaviour. No sync between my Nokia mobile and Linux with the latest kernel :( Sorry, I didn't see your response, have followed up on lkml now. It turned out this was actually a bug in Gammu that will be fixed in the next release of Gammu. thanks, greg k-h cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Willy Tarreau wrote: On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote: On Tuesday 24 April 2007, Ingo Molnar wrote: * David Lang [EMAIL PROTECTED] wrote: (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) if you are trying to unwedge a system it may be a good idea to renice all tasks to 0, it could be that a task at +19 is holding a lock that something else is waiting for. Yeah, that's possible too, but +19 tasks are getting a small but guaranteed share of the CPU so eventually it ought to release it. It's still a possibility, but i think i'll wait for a specific incident to happen first, and then react to that incident :-) Ingo In the instance I created, even the SysRq+b was ignored, and ISTR thats supposed to initiate a reboot is it not? So it was well and truly wedged. On many machines I use this on, I have to release Alt while still holding B. Don't know why, but it works like this. Willy Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the reputation for needing virgins to sacrifice, regular experienced girls just wouldn't do. This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need a sceance or something to conjure up the right spell. Besides, the reset button is only about 6 feet away... I get some execsize that way by getting up to push it. :) -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) It is so soon that I am done for, I wonder what I was begun for. -- Epitaph, Cheltenham Churchyard - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Willy Tarreau wrote: On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote: On Tuesday 24 April 2007, Ingo Molnar wrote: * David Lang [EMAIL PROTECTED] wrote: (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) if you are trying to unwedge a system it may be a good idea to renice all tasks to 0, it could be that a task at +19 is holding a lock that something else is waiting for. Yeah, that's possible too, but +19 tasks are getting a small but guaranteed share of the CPU so eventually it ought to release it. It's still a possibility, but i think i'll wait for a specific incident to happen first, and then react to that incident :-) Ingo In the instance I created, even the SysRq+b was ignored, and ISTR thats supposed to initiate a reboot is it not? So it was well and truly wedged. On many machines I use this on, I have to release Alt while still holding B. Don't know why, but it works like this. Willy Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the reputation for needing virgins to sacrifice, regular experienced girls just wouldn't do. This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need a sceance or something to conjure up the right spell. Besides, the reset button is only about 6 feet away... I get some execsize that way by getting up to push it. :) -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) It is so soon that I am done for, I wonder what I was begun for. -- Epitaph, Cheltenham Churchyard - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/17] Large Blocksize Support V3
FWIW, this would also let zisofs remove the ugly hacks we currently employ to deal with compression blocks. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/7] libata: check for AN support
On Tue, Apr 24, 2007 at 01:53:27PM -0700, Kristen Carlson Accardi wrote: Check to see if an ATAPI device supports Asynchronous Notification. If so, enable it. changes from last version: * fix typo in ata_id_has_AN and make word 76 test more clear * If we fail to set the AN feature, just print a warning and continue Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] @@ -299,6 +305,8 @@ struct ata_taskfile { #define ata_id_queue_depth(id) (((id)[75] 0x1f) + 1) #define ata_id_removeable(id)((id)[0] (1 7)) #define ata_id_has_dword_io(id) ((id)[50] (1 0)) +#define ata_id_has_AN(id)\ + (((id[76] != 0x) (id[76] != 0x)) ((id)[78] (1 5))) (id)[76] I guess ? Sorry for being a pain :/ OG. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/17] Large Blocksize Support V3
On Tue, 24 April 2007 15:21:05 -0700, [EMAIL PROTECTED] wrote: This patchset modifies the Linux kernel so that larger block sizes than page size can be supported. Larger block sizes are handled by using compound pages of an arbitrary order for the page cache instead of single pages with order 0. I like to see this. 2. 32/64k blocksize is also used in flash devices. Same issues. Actually most chips I encounter these days already have 128KiB. And some people seem to do some kind of raid-0 in the drivers to increase bandwidth. FS-visible blocksize is also increased by that. Unsupported - Mmapping blocks larger than page size Bummer. Can this change in the future? Issues: - There are numerous places where the kernel can no longer assume that the page cache consists of PAGE_SIZE pages that have not been fixed yet. - Defrag warning: The patch set can fragment memory very fast. It is likely that Mel Gorman's anti-frag patches and some more work by him on defragmentation may be needed if one wants to use super sized pages. If you run a 2.6.21 kernel with this patch and start a kernel compile on a 4k volume with a concurrent copy operation to a 64k volume on a system with only 1 Gig then you will go boom (ummm no ... OOM) fast. How well Mel's antifrag/defrag methods address this issue still has to be seen. only 1 Gig :) With my LogFS hat on, I don't care too much whether data is cached in terms of pages or blocks. What matters to me most is to get fed blocksize chunk on writeback and be able to read blocksize'd chunks. Compressing 64KiB at a time gives somewhere around 10% (don't remember exact number) better compression when compared to 4KiB. JFFS2 can benefit from this as well. That should also be sufficient for cross-platform compatibility, shouldn't it? Better performance for the pagecache is also nice to have, no doubt. But if system stability remains an issue, I'd rather keep slow and stable. Jörn -- More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity. -- W. A. Wulf - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/3] 2.6.21-rc7: known regressions (v2)
On Wed, Apr 25, 2007 at 02:29:58AM +0200, Adrian Bunk wrote: On Tue, Apr 24, 2007 at 05:14:28PM -0700, Greg KH wrote: On Tue, Apr 24, 2007 at 11:32:53AM +0200, Wolfgang Erig wrote: On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote: On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote: This email lists some known regressions in Linus' tree compared to 2.6.20. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: gammu no longer works References : http://lkml.org/lkml/2007/4/20/84 Submitter : Wolfgang Erig [EMAIL PROTECTED] Status : unknown I've asked for more information about this, and so far am not sure it's a real problem. It is a real problem for me. I tried this on 2 different boxes with the same behaviour. No sync between my Nokia mobile and Linux with the latest kernel :( Sorry, I didn't see your response, have followed up on lkml now. It turned out this was actually a bug in Gammu that will be fixed in the next release of Gammu. Ah, ok, thanks for letting me know. But how was the kernel version change triggering it? greg k-h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: AppArmor FAQ
Crispin Cowan wrote: David Wagner wrote: James Morris wrote: [...] you can change the behavior of the application and then bypass policy entirely by utilizing any mechanism other than direct filesystem access: IPC, shared memory, Unix domain sockets, local IP networking, remote networking etc. [...] Just look at their code and their own description of AppArmor. My gosh, you're right. What the heck? With all due respect to the developers of AppArmor, I can't help thinking that that's pretty lame. I think this raises substantial questions about the value of AppArmor. What is the point of having a jail if it leaves gaping holes that malicious code could use to escape? And why isn't this documented clearly, with the implications fully explained? I would like to hear the AppArmor developers defend this design decision. It was a simplicity trade off at the time, when AppArmor was mostly aimed at servers, and there was no HAL or DBUS. Now it is definitely a limitation that we are addressing. We are working on a mediation system for what kind of IPC a confined process can do http://forge.novell.com/pipermail/apparmor-dev/2007-April/000503.html Also, things like: share_mem /usr/bin/firefox r,# /bin/foo can share memory with /usr/bin/firefox for read only clearly show that you aren't using native abstractions for IPC. The native abstraction for shared memory would be the key used when creating the shared memory segment. The same goes for message queues which are noticeably missing from the simplified IPC model. This, of course, begs the question of whether you are using native abstractions for profiles at all, processes have nothing to do with the binary they started from after they've been started. The binary on disk could be something entirely different than the process from which it ran. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)
Karel Zak [EMAIL PROTECTED] writes: On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote: The following extra security measures are taken for unprivileged mounts: - usermounts are limited by a sysctl tunable - force nosuid,nodev mount options on the created mount The original userspace user= solution also implies the noexec option by default (you can override the default by exec option). It means the kernel based solution is not fully compatible ;-( Why noexec? Either it was a silly or arbitrary decision, or our kernel design may be incomplete. Now I can see not wanting to support executables if you are locking down a system. The classic don't execute a program from a CD just because the CD was stuck in the drive problem. So I can see how executing code from an untrusted source could prevent exploitation of other problems, and we certainly don't want to do it automatically. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
On Tue, Apr 24, 2007 at 04:41:58PM -0700, Linus Torvalds wrote: How many different magic ioctl's does the thing introduce? Is it really just *two* entry-points (and how simple are they, interface-wise), and nothing else? Aren't you a little late to the party here? The userland version is the one that currently is in the kernel, after all the people who said doing it in userland is not necessarily a good idea got happily ignored. Suspend2 which is the continuity of the fully-in-kernel one is the one that has been constantly rejected by Pavel, lately by saying it should be done in userspace, and hence never merged. Incidentally, it's 13 ioctls, and it's documented in Documentation/power/userland-swsusp.txt in a hard drive near you. I especially like the get the available swap space in bytes one that can only handle 32 bits. OG. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression with gammu on 2.6.21-rc7
On 4/24/07, Greg KH [EMAIL PROTECTED] wrote: Also, if you know how to use git, doing a 'git bisect' to try to track down the problem commit would be very helpful. Has to do with SIGIO, see this blog post: http://blog.cihar.com/archives/2007/04/24/kernel_2_6_21_hits_gammu/ Ray - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.
Andi Kleen wrote: On Tuesday 24 April 2007 23:50:26 David Miller wrote: From: Ashok Raj [EMAIL PROTECTED] Date: Tue, 24 Apr 2007 14:38:35 -0700 Its not clear if we have a very generic device breakage.. most devices on these platforms are going to be more recent, (except maybe some legacy fd)... I'm not so sure, there are some modern sound cards that have a 31-bit DMA addressing limitation because they use the 31st bit as a status bit in their DMA descriptors :-) There's also a 2GB only megaraid RAID controller that's pretty popular because Dell shipped it for a long time. You can probably find almost any possible bitmask if you look long enough. Hardware vendors are notorious for this kind of optimizations. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 3/8] Generic hardware support for Intel IOMMU.
On Tue, 2007-04-24 at 21:27 +0200, Andi Kleen wrote: On Tuesday 24 April 2007 08:03:02 Ashok Raj wrote: +#ifdef CONFIG_DMAR +#ifdef CONFIG_SMP +static void dmar_msi_set_affinity(unsigned int irq, cpumask_t mask) Why does it need an own interrupt type? + +config IOVA_GUARD_PAGE + bool Enables gaurd page when allocating IO Virtual Address for IOMMU + depends on DMAR + +config IOVA_NEXT_CONTIG + bool Keeps IOVA allocations consequent between allocations + depends on DMAR EXPERIMENTAL Needs reference to Intel and better description The file should have a high level description what it is good for etc. Need high level overview over what locks protects what and if there is a locking order. It doesn't seem to enable sg merging? Since you have enough space that should work. We actually have a patch to do sg merge. In my test, it doesn't have any performance gain. +static char *fault_reason_strings[] = +{ + Software, + Present bit in root entry is clear, + Present bit in context entry is clear, + Invalid context entry, + Access beyond MGAW, + PTE Write access is not set, + PTE Read access is not set, + Next page table ptr is invalid, + Root table address invalid, + Context table ptr is invalid, + non-zero reserved fields in RTP, + non-zero reserved fields in CTP, + non-zero reserved fields in PTE, + Unknown +}; + +#define MAX_FAULT_REASON_IDX (12) You got 14 of them. better use ARRAY_SIZE +#define IOMMU_NAME_LEN (7) + +struct iommu { call it intel_iommu or somesuch even when it's private. +static int __init intel_iommu_setup(char *str) +{ + if (!str) + return -EINVAL; + while (*str) { + if (!strncmp(str, off, 3)) { + dmar_disabled = 1; + printk(KERN_INFOIntel-IOMMU: disabled\n); + } + str += strcspn(str, ,); + while (*str == ',') + str++; + } + return 0; +} +__setup(intel_iommu=, intel_iommu_setup); Why can't you just use the normal iommu=off for this? iommu=off disable all iommu, intel_iommu=off just disables intel_iommu. Isn't possible people want to use other iommu like swiotlb? + +#define MIN_PGTABLE_PAGES (10) +static mempool_t *pgtable_mempool; +#define MIN_DOMAIN_REQ (20) +static mempool_t *domain_mempool; +#define MIN_DEVINFO_REQ(20) +static mempool_t *devinfo_mempool; Lots of mempools. How much memory does this pin? + +#define alloc_pgtable_page() mempool_alloc(pgtable_mempool, GFP_ATOMIC) +#define free_pgtable_page(vaddr) mempool_free(vaddr, pgtable_mempool) +#define alloc_domain_mem() mempool_alloc(domain_mempool, GFP_ATOMIC) +#define free_domain_mem(vaddr) mempool_free(vaddr, domain_mempool) +#define alloc_devinfo_mem() mempool_alloc(devinfo_mempool, GFP_ATOMIC) +#define free_devinfo_mem(vaddr) mempool_free(vaddr, devinfo_mempool) Do we need the macros? Better expand them in the caller. +static void __iommu_flush_cache(struct iommu *iommu, void *addr, int size) +{ + if (!ecap_coherent(iommu-ecap)) + clflush_cache_range(addr, size); +} + +#define iommu_flush_cache_entry(iommu, addr) \ + __iommu_flush_cache(iommu, addr, 8) +#define iommu_flush_cache_page(iommu, addr) \ + __iommu_flush_cache(iommu, addr, PAGE_SIZE_4K) Similar. And the 8 should be probably something more descriptive (sizeof?) +/* context entry handling */ +static struct context_entry * device_to_context_entry(struct iommu *iommu, + u8 bus, u8 devfn) +{ + struct root_entry *root; + struct context_entry *context; + unsigned long phy_addr; + unsigned long flags; + + spin_lock_irqsave(iommu-lock, flags); + root = iommu-root_entry[bus]; + if (!root_present(*root)) { + phy_addr = (unsigned long)alloc_pgtable_page(); A GFP_ATOMIC mempool is rather useless. mempool only works if it can block for someone else freeing memory and if it can't do that it's not failsafe. I'm afraid you need to revise the allocation strategy -- best would be to somehow move the memory allocations outside the spinlock paths and preallocate if possible. The problem is pci_map_single and friends usually called with interrupt disabled or spin locked, so we must use GFP_ATOMIC. Same problem in other code. + if (!dma_pte_present(*pte)) { + tmp = alloc_pgtable_page(); Please don't name variable tmp. I know some other code does it, but it's just bad style imho. + /* Make sure hardware complete it */ + start_time = jiffies; + while (1) { + sts = dmar_readl(iommu-reg, DMAR_GSTS_REG); + if (sts DMA_GSTS_RTPS) + break; + if (time_after(jiffies, start_time + DMAR_OPERATION_TIMEOUT)) +
Re: [1/3] 2.6.21-rc7: known regressions (v2)
On Tue, Apr 24, 2007 at 05:51:11PM -0700, Greg KH wrote: On Wed, Apr 25, 2007 at 02:29:58AM +0200, Adrian Bunk wrote: On Tue, Apr 24, 2007 at 05:14:28PM -0700, Greg KH wrote: On Tue, Apr 24, 2007 at 11:32:53AM +0200, Wolfgang Erig wrote: On Mon, Apr 23, 2007 at 03:18:19PM -0700, Greg KH wrote: On Mon, Apr 23, 2007 at 11:48:47PM +0200, Adrian Bunk wrote: This email lists some known regressions in Linus' tree compared to 2.6.20. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: gammu no longer works References : http://lkml.org/lkml/2007/4/20/84 Submitter : Wolfgang Erig [EMAIL PROTECTED] Status : unknown I've asked for more information about this, and so far am not sure it's a real problem. It is a real problem for me. I tried this on 2 different boxes with the same behaviour. No sync between my Nokia mobile and Linux with the latest kernel :( Sorry, I didn't see your response, have followed up on lkml now. It turned out this was actually a bug in Gammu that will be fixed in the next release of Gammu. Ah, ok, thanks for letting me know. But how was the kernel version change triggering it? I don't know, perhaps a side effect of Eric's work in kernel/signal.c? The bug in Gammu was: - Gammu wrongly set FASYNC in a fcntl() call. - The unhandled SIGIO terminated Gammu in 2.6.21-rc. Gammu being terminated by the SIGIO seems to be expected and documented behavior, and the surprising thing is that it wasn't terminated with earlier kernels. greg k-h cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Multiqueue network device support
This is a redesign and repost of the multiqueue network device support patches. The new API for base drivers allows multiqueue-capable devices to manage their individual queues in the network stack. The stack now handles both non-multiqueue and multiqueue devices on the same codepath. Also, allocation and deallocation of the queues is handled by the kernel instead of the driver. The PRIO qdisc is now modified to run in single-queue mode on multiqueue devices by default. A modification to tc is in another patchset being sent that allows multiqueue behavior to be turned on for PRIO. Documentation is also included describing in more detail how this works, as wellas how a base driver can use the API to implement multiple queues. These patches can also be pulled from my git repository at: git-pull git://lost.foo-projects.org/~ppwaskie/git/netdev-2.6.22 mq -- Peter P. Waskiewicz Jr. [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] Adding documentation for the new multiqueue API.
From: Peter P Waskiewicz Jr [EMAIL PROTECTED] Signed-off-by: Peter P. Waskiewicz Jr [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- Documentation/networking/multiqueue.txt | 97 +++ 1 files changed, 97 insertions(+), 0 deletions(-) diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt new file mode 100644 index 000..0bc5222 --- /dev/null +++ b/Documentation/networking/multiqueue.txt @@ -0,0 +1,97 @@ + + HOWTO for multiqueue network device support + === + +Section 1: Base driver requirements for implementing multiqueue support +Section 2: Qdisc support for multiqueue devices +Section 3: Brief howto using PRIO for multiqueue devices + + +Intro: Kernel support for multiqueue devices +- + +Kernel support for multiqueue devices is only an API that is presented to the +netdevice layer for base drivers to implement. This feature is part of the +core networking stack, and all network devices will be running on the +multiqueue-aware stack. If a base driver only has one queue, then these +changes are transparent to that driver. + + +Section 2: Base driver requirements for implementing multiqueue support +--- + +Base drivers are required to use the new alloc_etherdev_mq() or +alloc_netdev_mq() functions to allocate the subqueues for the device. The +underlying kernel API will take care of the allocation and deallocation of +the subqueue memory, as well as netdev configuration of where the queues +exist in memory. + +The base driver will also need to manage the queues as it does the global +netdev-queue_lock today. Therefore base drivers should use the +netif_{start|stop|wake}_subqueue() functions to manage each queue while the +device is still operational. netdev-queue_lock is still used when the device +comes online or when it's completely shut down (unregister_netdev(), etc.). + +Finally, the base driver should indicate that it is a multiqueue device. The +feature flag NETIF_F_MULTI_QUEUE should be added to the netdev-features +bitmap on device initialization. Below is an example from e1000: + +#ifdef CONFIG_E1000_MQ + if ( (adapter-hw.mac.type == e1000_82571) || +(adapter-hw.mac.type == e1000_82572) || +(adapter-hw.mac.type == e1000_80003es2lan)) + netdev-features |= NETIF_F_MULTI_QUEUE; +#endif + + +Section 3: Qdisc support for multiqueue devices +--- + +Currently two qdiscs support multiqueue devices. The default qdisc, pfifo_fast, +and the PRIO qdisc. The qdisc is responsible for classifying the skb's to +bands and queues, and will store the queue mapping into skb-queue_mapping. +Use this field in the base driver to determine which queue to send the skb +to. + +pfifo_fast, being the default qdisc when a device is brought online, will not +assign a queue mapping, therefore the skb will have a value of zero. We +cannot assume anything about the device itself, how many queues it really has, +etc. Therefore sending all traffic to queue 0 is the safest thing to do here. + +The PRIO qdisc naturally plugs into a multiqueue device. Upon load of the +qdisc, PRIO will make a best-effort assignment of queue to PRIO band to evenly +distribute traffic flows. The algorithm can be found in prio_tune() in +net/sched/sch_prio.c. Once the association is made, any skb that is +classified will have skb-queue_mapping set, which will allow the driver to +properly queue skb's to multiple queues. + + +Section 4: Brief howto using PRIO for multiqueue devices + + +The userspace command 'tc,' part of the iproute2 package, is used to configure +qdiscs. To add the PRIO qdisc to your network device, assuming the device is +called eth0, run the following command: + +# tc qdisc add dev eth0 root handle 1: prio multiqueue + +This will create 3 bands, 0 being highest priority, and associate those bands +to the queues on your NIC. Assuming eth0 has 2 Tx queues, the band mapping +would look like: + +band 0 = queue 0 +band 1 = queue 0 +band 2 = queue 1 + +Traffic will begin flowing through each queue if your TOS values are assigning +traffic across the various bands. For example, ssh traffic will always try to +go out band 0 based on TOS - Linux priority conversion (realtime traffic), +so it will be sent out queue 0. ICMP traffic (pings) fall into the normal +traffic classification, which is band 1. Therefore pings will be send out +queue 1 on the NIC. + +The behavior of tc filters remains the same, where it will override TOS priority +classification. + + +Author: Peter P. Waskiewicz Jr. [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message
[PATCH] IPROUTE: Modify tc for new PRIO multiqueue behavior
From: Peter P Waskiewicz Jr [EMAIL PROTECTED] Modified tc so PRIO can now have a multiqueue parameter passed to it. This will turn on multiqueue behavior if a device has more than 1 queue. Also, running tc qdisc ls dev dev will display if multiqueue is on or off. Signed-off-by: Peter P. Waskiewicz Jr [EMAIL PROTECTED] --- include/linux/pkt_sched.h |1 + tc/q_prio.c |9 ++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h index d10f353..bab0b9e 100644 --- a/include/linux/pkt_sched.h +++ b/include/linux/pkt_sched.h @@ -99,6 +99,7 @@ struct tc_prio_qopt { int bands; /* Number of bands */ __u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority - PRIO band */ + unsigned short multiqueue; /* 0 for no mq, 1 for mq */ }; /* TBF section */ diff --git a/tc/q_prio.c b/tc/q_prio.c index d696e1b..55cb207 100644 --- a/tc/q_prio.c +++ b/tc/q_prio.c @@ -29,7 +29,7 @@ static void explain(void) { - fprintf(stderr, Usage: ... prio bands NUMBER priomap P1 P2...\n); + fprintf(stderr, Usage: ... prio [multiqueue] bands NUMBER priomap P1 P2...\n); } #define usage() return(-1) @@ -39,7 +39,7 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct n int ok=0; int pmap_mode = 0; int idx = 0; - struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 }}; + struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 },0}; while (argc 0) { if (strcmp(*argv, bands) == 0) { @@ -57,7 +57,9 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct n return -1; } pmap_mode = 1; - } else if (strcmp(*argv, help) == 0) { + } else if (strcmp(*argv, multiqueue) == 0) + opt.multiqueue = 1; + else if (strcmp(*argv, help) == 0) { explain(); return -1; } else { @@ -105,6 +107,7 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt) if (RTA_PAYLOAD(opt) sizeof(*qopt)) return -1; qopt = RTA_DATA(opt); + fprintf(f, multiqueue %s , qopt-multiqueue ? on : off); fprintf(f, bands %u priomap , qopt-bands); for (i=0; i=TC_PRIO_MAX; i++) fprintf(f, %d, qopt-priomap[i]); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [REPORT] cfs-v4 vs sd-0.44
Could you explain for the audience the technical definition of fairness and what sorts of error metrics are commonly used? There seems to be some disagreement, and you're neutral enough of an observer that your statement would help. The definition for proportional fairness assumes that each thread has a weight, which, for example, can be specified by the user, or sth. mapped from thread priorities, nice values, etc. A scheduler achieves ideal proportional fairness if (1) it is work-conserving, i.e., it never leaves a processor idle if there are runnable threads, and (2) for any two threads, i and j, in any time interval, the ratio of their CPU time is greater than or equal to the ratio of their weights, assuming that thread i is continuously runnable in the entire interval and both threads have fixed weights throughout the interval. A corollary of this is that if both threads i and j are continuously runnable with fixed weights in the time interval, then the ratio of their CPU time should be equal to the ratio of their weights. This definition is pretty restrictive since it requires the properties to hold for any thread in any interval, which is not feasible. In practice, all algorithms try to approximate this ideal scheduler (often referred to as Generalized Processor Scheduling or GPS). Two error metrics are often used: (1) lag(t): for any interval [t1, t2], the lag of a thread at time t \in [t1, t2] is S'(t1, t) - S(t1, t), where S' is the CPU time the thread would receive in the interval [t1, t] under the ideal scheduler and S is the actual CPU time it receives under the scheduler being evaluated. (2) The second metric doesn't really have an agreed-upon name. Some call it fairness measure and some call it sth else. Anyway, different from lag, which is kind of an absolute measure for one thread, this metric (call it F) defines a relative measure between two threads over any time interval: F(t1, t2) = S_i(t1, t2) / w_i - S_j(t1, t2) / w_j, where S_i and S_j are the CPU time the two threads receive in the interval [t1, t2] and w_i and w_j are their weights, assuming both weights don't change throughout the interval. The goal of a proportional-share scheduling algorithm is to minimize the above metrics. If the lag function is bounded by a constant for any thread in any time interval, then the algorithm is considered to be fair. You may notice that the second metric is actually weaker than first. In fact, if an algorithm achieves a constant lag bound, it must also achieve a constant bound for the second metric, but the reverse is not necessarily true. But in some settings, people have focused on the second metric and still consider an algorithm to be fair as long as the second metric is bounded by a constant. On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote: I understand that via experiments we can show a design is reasonably fair in the common case, but IMHO, to claim that a design is fair, there needs to be some kind of formal analysis on the fairness bound, and this bound should be proven to be constant. Even if the bound is not constant, at least this analysis can help us better understand and predict the degree of fairness that users would experience (e.g., would the system be less fair if the number of threads increases? What happens if a large number of threads dynamically join and leave the system?). Carrying out this sort of analysis on various policies would help, but I'd expect most of them to be difficult to analyze. cfs' current -fair_key computation should be simple enough to analyze, at least ignoring nice numbers, though I've done nothing rigorous in this area. If we can derive some invariants from the algorithm, it'd help the analysis. An example is the deficit round-robin (DRR) algorithm in networking. Its analysis utilizes the fact that the round each flow (in this case, it'd be thread) goes through in any time interval differs by at most one. Hope you didn't get bored by all of this. :) tong - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] NET: [UPDATED] Multiqueue network device support implementation.
From: Peter P Waskiewicz Jr [EMAIL PROTECTED] Update: Fixed band2queue mapping logic - it was reveresed with prio2band. Added support in the PRIO qdisc to allow tc to turn on multiqueue behavior, while keeping original PRIO behavior by default. Fixed where skb-queue_mapping is being reset (prior to q-enqueue() ). Added an API and associated supporting routines for multiqueue network devices. This allows network devices supporting multiple TX queues to configure each queue within the netdevice and manage each queue independantly. Changes to the PRIO Qdisc also allow a user to map multiple flows to individual TX queues, taking advantage of each queue on the device. Signed-off-by: Peter P. Waskiewicz Jr [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- include/linux/etherdevice.h |3 +- include/linux/netdevice.h | 66 ++- include/linux/pkt_sched.h |1 + include/linux/skbuff.h |2 + net/core/dev.c | 28 +++--- net/core/skbuff.c |3 ++ net/ethernet/eth.c |9 +++--- net/sched/sch_generic.c |4 +-- net/sched/sch_prio.c| 66 +++ 9 files changed, 162 insertions(+), 20 deletions(-) diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index 745c988..446de39 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -39,7 +39,8 @@ extern void eth_header_cache_update(struct hh_cache *hh, struct net_device *dev extern int eth_header_cache(struct neighbour *neigh, struct hh_cache *hh); -extern struct net_device *alloc_etherdev(int sizeof_priv); +extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count); +#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1) static inline void eth_copy_and_sum (struct sk_buff *dest, const unsigned char *src, int len, int base) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 584c199..6829880 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -108,6 +108,14 @@ struct wireless_dev; #define MAX_HEADER (LL_MAX_HEADER + 48) #endif +struct net_device_subqueue +{ + /* Give a control state for each queue. This struct may contain +* per-queue locks in the future. +*/ + unsigned long state; +}; + /* * Network device statistics. Akin to the 2.0 ether stats but * with byte counters. @@ -326,6 +334,7 @@ struct net_device #define NETIF_F_GSO2048/* Enable software GSO. */ #define NETIF_F_LLTX 4096/* LockLess TX */ #define NETIF_F_INTERNAL_STATS 8192/* Use stats structure in net_device */ +#define NETIF_F_MULTI_QUEUE16384 /* Has multiple TX/RX queues */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -538,6 +547,14 @@ struct net_device struct device dev; /* space for optional statistics and wireless sysfs groups */ struct attribute_group *sysfs_groups[3]; + + /* To retrieve statistics per subqueue - FOR FUTURE USE */ + struct net_device_stats* (*get_subqueue_stats)(struct net_device *dev, + int queue_index); + + /* The TX queue control structures */ + struct net_device_subqueue *egress_subqueue; + int egress_subqueue_count; }; #define to_net_dev(d) container_of(d, struct net_device, dev) @@ -679,6 +696,48 @@ static inline int netif_running(const struct net_device *dev) return test_bit(__LINK_STATE_START, dev-state); } +/* + * Routines to manage the subqueues on a device. We only need start + * stop, and a check if it's stopped. All other device management is + * done at the overall netdevice level. + * Also test the device if we're multiqueue. + */ +static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index) +{ + clear_bit(__LINK_STATE_XOFF, dev-egress_subqueue[queue_index].state); +} + +static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index) +{ +#ifdef CONFIG_NETPOLL_TRAP + if (netpoll_trap()) + return; +#endif + set_bit(__LINK_STATE_XOFF, dev-egress_subqueue[queue_index].state); +} + +static inline int netif_subqueue_stopped(const struct net_device *dev, + u16 queue_index) +{ + return test_bit(__LINK_STATE_XOFF, + dev-egress_subqueue[queue_index].state); +} + +static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index) +{ +#ifdef CONFIG_NETPOLL_TRAP + if (netpoll_trap()) + return; +#endif + if (test_and_clear_bit(__LINK_STATE_XOFF, +
Re: Kernel traces coming back with trash/clutter
I am getting this odd content in the trace log (dmesg), and I cannot figure out what it is or why it is there. 7 7 7 7 7__bio_add_page: 2x ph 88=128 || hw 88=88 || 360448max 802525d8 generic_make_request(bio 01017c745300) 50729472, 704 Perhaps you have something looping that's outputting KERN_DEBUG with a null message? Or one of your diagnostic printk statements includes KERN_DEBUG with no actual message? No, they are all KERN_DEBUGspacesome string here, almost all with some formatted output as well. Could I be overloading the printk output buffer, as in possibly too tightly repeated/looped code to be able to output it all? It is possible, I suppose. Is what you're working on open-source? If so, you could send it to me and I could try and reproduce it here and track it down. If you want me to, that is. (If you do send, please include a .config.) Otherwise, I couldn't tell you what it might be. Make sure all your messages end with '\n', make sure you're not accidentally using the wrong formatting codes and it's backing over previous output with ^H or something. You could confirm or rule out the possibility of overflowing the printk buffers by writing a dummy module with a tight loop of nothing but printk statements with counters to see if you can get it to asplode. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Reasons to merge suspend2.
Hi all. I've been working on this email on and off for a while, but since Pavel raised the issue again, I thought I should make a concerted effort to finish it... In this email, I'm going to outline the problems with the current design (uswsusp and swsusp) and the ways in which Suspend2 overcomes those limitations, before going on to outline the additional advantages Suspend2 has for users and address objections previously raised against merging Suspend2. A) Problems with the current design. 1) Ordering of operations. The current [u]swsusp design doesn't do things in discrete, well ordered stages. Storage for the image is not allocated until after the atomic copy has been done. This means that the process can fail when we are a significant portion of the way into suspending, and it means it can fail when the user will seriously expect it to run to completion. The solution to this issue is simple: separate preparing to suspend from actually writing the image. In the preparation step, ensure, so far as you are able, that there will be sufficient memory and sufficient storage to complete the process, and don't write anything or do any atomic copying until after that has been done. The only valid objection I can think of is that you can't know for certain prior to doing the atomic copy how much memory storage will be needed for allocations by driver suspend methods. That can be addressed by a simple extension of the driver model, where in drivers could report how many pages they will need. (If slab will be needed, the worst case can be assumed). Rafael's notify patches (recently posted) also help in that area. Once processes are frozen, all significant memory usage can be accounted for, because the process doing the suspending will be the only one allocating memory. 2) Limit on image size. The current implementation limits the size of an image to an absolute maximum of half the amount of ram. This is certainly an improvement over the old days where it sought to free everything it could, but it's still not good enough. Current memory freeing code doesn't free the exact amount requested; often far more than has been requested is freed. This does not only result in a smaller image. It also means the system is proportionately less responsive on resume at whatever stage that those pages are needed again. A full image is certainly not needed by everyone. Those with huge amounts of memory, very fast storage devices or particular memory usage patterns may, quite rightly, not want to store the whole lot in an image. This doesn't mean, however, that those who want or need (from their perspective) a full image of memory shouldn't be able to have it. It just adds to the argument for making it tunable (which swsusp has done too). 3) Lack of provision for tuning to individual needs. Swsusp historically included very little provision whatsoever for the user to tune their configuration. This has recently begun to change, and I applaud that. But it needs to go further. Suspending to disk is not a one-size-fits-all situation. People have different hardware configurations, with the result being that some people benefit from compression while others do better without it. Some people want encryption in a particular configuration while others don't care about encryption at all. Some people want to limit the image size, others don't. Sometimes a user might want to reboot instead of powering down (dual booting). All of this should be doable, without having to hack the code or recompile the kernel, and should be as simple as possible. Suspend2, via its /sys/power/suspend2 interface and hibernate-script porcelain, makes this easy. 4) No support for multiple swap devices / non swap storage. Until recently, [u]swsusp supported a single swap partition only. Support for a swap file has been added, but [u]swsusp still supports only one swap device at a time. For most people, this is adequate, but this doesn't mean everyone should be forced to fit this mould. [u]swsusp also lacks support for storage to non-swap. Particularly in systems that rely on swap for normal activity, this can make [u]swsusp less reliable. The amount of swap available varies according to workload, so sometimes the user will be unable to suspend. To address this raciness/competition against other swap usage, Suspend2 supports writing to a generic file, either a partition or a file on an ordinary partition. B) Further advantages of Suspend2. == 1) Improvements over swsusp. a) Modular design. Parts of Suspend2 implement support for storing an image in swap or in a file, using cryptoapi for compression and/or encryption and talking to a userspace user interface via a netlink socket. Suspend2 works just fine without CONFIG_SWAP, CONFIG_NET and/or CONFIG_CRYPTOAPI, however, because it uses a modular design wherein support for these subsystems is abstracted
RE: [PATCH] x86_64/acpi: make kernel to be compiled when CONFIG_ACPI_NUMA is set and power management with acpi is not enabled
-Original Message- From: Len Brown [mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 10, 2007 1:33 AM Let me know if you have one that doesn't. Please check this one. it will not compiled. grep ACPI .config CONFIG_X86_64_ACPI_NUMA=y CONFIG_ACPI=y CONFIG_ACPI_NUMA=y # CONFIG_PNPACPI is not set # CONFIG_BLK_DEV_IDEACPI is not set CONFIG_SATA_ACPI=y YH # # Automatically generated make config: don't edit # Linux kernel version: 2.6.21-rc7 # Tue Apr 24 18:27:37 2007 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION=-smp CONFIG_LOCALVERSION_AUTO=y # CONFIG_SWAP is not set CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y # CONFIG_POSIX_MQUEUE is not set CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_CPUSETS is not set CONFIG_SYSFS_DEPRECATED=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set # CONFIG_KMOD is not set CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=anticipatory # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set CONFIG_MK8=y # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y # CONFIG_MICROCODE is not set CONFIG_X86_MSR=y CONFIG_X86_CPUID=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set CONFIG_NUMA=y CONFIG_K8_NUMA=y CONFIG_NODES_SHIFT=6 CONFIG_X86_64_ACPI_NUMA=y # CONFIG_NUMA_EMU is not set CONFIG_ARCH_DISCONTIGMEM_ENABLE=y CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y # CONFIG_FLATMEM_MANUAL is not set CONFIG_DISCONTIGMEM_MANUAL=y # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_DISCONTIGMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_NEED_MULTIPLE_NODES=y # CONFIG_SPARSEMEM_STATIC is not set # CONFIG_MEMORY_HOTPLUG is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_MIGRATION=y CONFIG_RESOURCES_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y CONFIG_NR_CPUS=255 # CONFIG_HOTPLUG_CPU is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y # CONFIG_HPET_EMULATE_RTC is not set CONFIG_IOMMU=y # CONFIG_CALGARY_IOMMU is not set CONFIG_SWIOTLB=y CONFIG_X86_MCE=y # CONFIG_X86_MCE_INTEL is not set CONFIG_X86_MCE_AMD=y CONFIG_KEXEC=y # CONFIG_CRASH_DUMP is not set CONFIG_PHYSICAL_START=0x20 CONFIG_SECCOMP=y # CONFIG_CC_STACKPROTECTOR is not set # CONFIG_HZ_100 is not set CONFIG_HZ_250=y # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=250 # CONFIG_REORDER is not set CONFIG_K8_NB=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y CONFIG_GENERIC_PENDING_IRQ=y # # Power management options # # CONFIG_PM is not set CONFIG_ACPI=y CONFIG_ACPI_NUMA=y # # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y # CONFIG_CPU_FREQ_DEBUG is not set CONFIG_CPU_FREQ_STAT=y #
Re: [PATCH 2/2] Align ZONE_MOVABLE to a MAX_ORDER_NR_PAGES boundary
Looks good. :-) Thanks. Acked-by: Yasunori Goto [EMAIL PROTECTED] The boot memory allocator makes assumptions on the alignment of zone boundaries even though the buddy allocator has no requirements on the alignment of zones. This may cause boot problems in situations where ZONE_MOVABLE is populated because the bootmem allocator assumes zones are at least order-log2(BITS_PER_LONG) aligned. As the two potential users (huge pages and memory hot-remove) of ZONE_MOVABLE would prefer a higher alignment, this patch aligns the start of the zone instead of fixing the different assumptions made by the bootmem allocator. This patch rounds the start of ZONE_MOVABLE in each node to a MAX_ORDER_NR_PAGES boundary. If the rounding pushes the start of ZONE_MOVABLE above the end of the node then the zone will contain no memory and will not be used at runtime. The value is rounded up instead of down as it is better to have the kernel-portion of memory larger than requested instead of smaller. The impact is that the kernel-usable portion of memory because a minimum guarantee instead of the exact size requested by the user. Signed-off-by: Mel Gorman [EMAIL PROTECTED] Acked-by: Andy Whitcroft [EMAIL PROTECTED] --- page_alloc.c |5 + 1 files changed, 5 insertions(+) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c --- linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c 2007-04-24 09:38:30.0 +0100 +++ linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c 2007-04-24 11:15:40.0 +0100 @@ -3642,6 +3642,11 @@ restart: usable_nodes--; if (usable_nodes required_kernelcore usable_nodes) goto restart; + + /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ + for (nid = 0; nid MAX_NUMNODES; nid++) + zone_movable_pfn[nid] = + roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); } /** -- Yasunori Goto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/25] xen: Core Xen implementation
Andi Kleen wrote: On Monday 23 April 2007 23:56:44 Jeremy Fitzhardinge wrote: Core Xen Implementation This patch is a rollup of all the core pieces of the Xen implementation, including booting, memory management, interrupts, time and so on. The patch is definitely too big. Yes. It was originally smaller patches, which I tried to keep in a state where everything was incrementally buildable, but it got too hard to keep it all together. I guess I can break it down into functional groups, and put the config stuff at the end. +#ifdef CONFIG_XEN +/* Xen only supports sysenter/sysexit in ring0 guests, + and only if it the guest asks for it. So for now, + this should never be used. */ +ENTRY(xen_sti_sysexit) +CFI_STARTPROC +ud2 +CFI_ENDPROC +ENDPROC(xen_sti_sysexit) Put that elsewhere? It doesn't need to be here. Yes, I can drop it. It's not needed in this kernel. +++ b/arch/i386/xen/enlighten.c @@ -0,0 +1,727 @@ Comments describing what all the files do? OK. +unsigned maskedx = ~0; +if (*eax == 1) +maskedx = ~((1 X86_FEATURE_APIC) | +(1 X86_FEATURE_ACPI) | +(1 X86_FEATURE_ACC)); Why ACC? And why doesn't Xen mask those by itself? Because it doesn't care whether they're set or not. I'm suppressing them here to prevent the kernel from trying to use these features. I suppress ACC in particular to stop the P4 thermal interrupt code from trying to do anything. I'll comment it. And you got apic functions later which would be never called? Why are the hooks needed then? They aren't. They're only for VMI. I've only got them to make sure that there are no stray APIC usages. + +static unsigned long xen_save_fl(void) +{ +struct vcpu_info *vcpu; +unsigned long flags; + +preempt_disable(); +vcpu = x86_read_percpu(xen_vcpu); +/* flag has opposite sense of mask */ +flags = !vcpu-evtchn_upcall_mask; +preempt_enable(); If you use get_cpu/put_cpu it will be optimized away on PREEMPT !SMP (more occurrences) Won't preempt_disable disappear as well? I don't need the CPU number. +static void xen_restore_fl(unsigned long flags) +{ +struct vcpu_info *vcpu; + +preempt_disable(); + +/* convert from IF type flag */ +flags = !(flags X86_EFLAGS_IF); +vcpu = x86_read_percpu(xen_vcpu); +vcpu-evtchn_upcall_mask = flags; +if (flags == 0) { +barrier(); /* unmask then check (avoid races) */ Don't you need a rmb() here then? The CPU could speculate reads (more occurrences) OK. +if (unlikely(vcpu-evtchn_upcall_pending)) +force_evtchn_callback(); +preempt_enable(); +} else +preempt_enable_no_resched(); +} + +static void xen_irq_disable(void) +{ +struct vcpu_info *vcpu; +preempt_disable(); +vcpu = x86_read_percpu(xen_vcpu); +vcpu-evtchn_upcall_mask = 1; +preempt_enable_no_resched(); First with the new per cpu the preempt disable shouldn't be needed anymore because the thing is atomic. In the worst case you do the change on the previous CPU, but that can happen anyways after preempt_enable No, there's a one instruction preempt window there. If I do: mov %fs:xen_vcpu, %eax movb $1,1(%eax) and a preempt happens in between, then the interrupt will be disabled on the wrong cpu. Once we can put the vcpu structure into the percpu area directly, then I can do: movb $1,%fs:xen_vcpu+1 which is preempt-safe, of course. And then when you have enabled who transfers the irq off state to the new CPU? I don't follow you. +static void xen_halt(void) +{ +#if 0 +if (irqs_disabled()) +HYPERVISOR_vcpu_op(VCPUOP_down, smp_processor_id(), NULL); +#endif +} Who halts then? I fix this up in the xen-machine-ops.patch. +static void xen_load_gdt(const struct Xgt_desc_struct *dtr) +{ +unsigned long *frames; +unsigned long va = dtr-address; +unsigned int size = dtr-size + 1; +int f; +struct multicall_space mcs; + +BUG_ON(size 16*PAGE_SIZE); Why 16? I'll make it more explicit. It's 64k of GDT entries == 16 pages. +count = desc-size / 8; +BUG_ON(count 256); should be = ? I think 256 idt entries is OK, but it should be (desc-size+1) / 8. +static void xen_set_iopl_mask(unsigned mask) +{ +#if 0 +struct physdev_set_iopl set_iopl; + +/* Force the change at ring 0. */ +set_iopl.iopl = (mask == 0) ? 1 : (mask 12) 3; +HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, set_iopl); +#endif And who does iopl then? Nobody at the moment. I don't think there's much need for it in an unprivileged Xen domU. I could just nop it out for now. + * Page-directory addresses above 4GB do not fit into architectural
Re: [00/17] Large Blocksize Support V3
On Tue, Apr 24, 2007 at 03:21:05PM -0700, [EMAIL PROTECTED] wrote: V2-V3 - More restructuring - It actually works! - Add XFS support - Fix up UP support - Work out the direct I/O issues - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert back to constants. Disabled for 32bit and HIGHMEM configurations. This also allows a gradual migration to the new page cache inline functions. LARGE_BLOCKSIZE capabilities can be added gradually and if there is a problem then we can disable a subsystem. Excellent, I'll do some testing here at the very least. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] drivers/ata: remove the wildcard from sata_nv driver
Because nvidia SATA controllers onward base on AHCI, so wildcard in sata_nv driver is unnecessary. Also the wildcard sometimes cause sata_nv driver to be loaded for AHCI controllers,which is not as expected. Signed-off-by: Peer Chen [EMAIL PROTECTED] = --- linux-2.6.21-rc7/drivers/ata/sata_nv.c.orig +++ linux-2.6.21-rc7/drivers/ata/sata_nv.c @@ -285,12 +285,6 @@ static const struct pci_device_id nv_pci { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA), GENERIC }, { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA2), GENERIC }, { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA3), GENERIC }, - { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, - PCI_ANY_ID, PCI_ANY_ID, - PCI_CLASS_STORAGE_IDE8, 0x00, GENERIC }, - { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, - PCI_ANY_ID, PCI_ANY_ID, - PCI_CLASS_STORAGE_RAID8, 0x00, GENERIC }, { } /* terminate list */ }; --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH] syctl for selecting global zonelist[] order
Make zonelist policy selectable from sysctl. Assume 2 node NUMA, only node(0) has ZONE_DMA (ZONE_DMA32). In this case, default (node0's) zonelist order is Node(0)'s NORMAL - Node(0)'s DMA - Node(1)s NORMAL. This means Node(0)'s DMA is used before Node(1)'s NORMAL. In some server, some application uses large memory allcation. This exhaust memory in the above order. Thensometimes OOM_KILL will occur when 32bit device requires memory. This patch adds sysctl for rebuilding zonelist after boot and doesn't change default zonelist order. command: %echo 0 /proc/sys/vm/better_locality Will rebuild zonelist in following order. Node(0)'s NORMAL - Node(1)'s NORMAL - Node(0)'s DMA. if set better_locality == 1 (default), zonelist is Node(0)'s NORMAL - Node(0)'s DMA - Node(1)'s NORMAL. Maybe useful in some users with heavy memory pressure and mlocks. Tested under ia64 2 node NUMA against 2.6.21-rc7.. works well. Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED] Index: linux-2.6.21-rc7/kernel/sysctl.c === --- linux-2.6.21-rc7.orig/kernel/sysctl.c +++ linux-2.6.21-rc7/kernel/sysctl.c @@ -76,6 +76,9 @@ extern int pid_max_min, pid_max_max; extern int sysctl_drop_caches; extern int percpu_pagelist_fraction; extern int compat_log; +#ifdef CONFIG_NUMA +extern int sysctl_better_locality; +#endif /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; @@ -845,6 +848,15 @@ static ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, + { + .ctl_name = VM_BETTER_LOCALITY, + .procname = better_locality, + .data = sysctl_better_locality, + .maxlen = sizeof(sysctl_better_locality), + .mode = 0644, + .proc_handler = sysctl_better_locality_handler, + .strategy = sysctl_intvec, + }, #endif #if defined(CONFIG_X86_32) || \ (defined(CONFIG_SUPERH) defined(CONFIG_VSYSCALL)) Index: linux-2.6.21-rc7/mm/page_alloc.c === --- linux-2.6.21-rc7.orig/mm/page_alloc.c +++ linux-2.6.21-rc7/mm/page_alloc.c @@ -1670,7 +1670,7 @@ static int __meminit build_zonelists_nod #ifdef CONFIG_NUMA #define MAX_NODE_LOAD (num_online_nodes()) -static int __meminitdata node_load[MAX_NUMNODES]; +static int node_load[MAX_NUMNODES]; /** * find_next_best_node - find the next node that should appear in a given node's fallback list * @node: node whose fallback list we're appending @@ -1685,7 +1685,7 @@ static int __meminitdata node_load[MAX_N * on them otherwise. * It returns -1 if no node is found. */ -static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask) +static int find_next_best_node(int node, nodemask_t *used_node_mask) { int n, val; int min_val = INT_MAX; @@ -1731,7 +1731,10 @@ static int __meminit find_next_best_node return best_node; } -static void __meminit build_zonelists(pg_data_t *pgdat) +/* + * Build zonelists based on node locality. + */ +static void build_zonelists_locality_aware(pg_data_t *pgdat) { int j, node, local_node; enum zone_type i; @@ -1780,6 +1783,78 @@ static void __meminit build_zonelists(pg } } +/* + * Build zonelist based on zone priority. + */ +static int node_order[MAX_NUMNODES]; +static void build_zonelists_zone_aware(pg_data_t *pgdat) +{ + int i, j, pos, zone_type, node, load; + nodemask_t used_mask; + int local_node, prev_node; + struct zone *z; + struct zonelist *zonelist; + + for (i = 0; i MAX_NR_ZONES; i++) { + zonelist = pgdat-node_zonelists + i; + zonelist-zones[0] = NULL; + } + memset(node_order, 0, sizeof(node_order)); + local_node = pgdat-node_id; + load = num_online_nodes(); + prev_node = local_node; + nodes_clear(used_mask); + j = 0; + while ((node = find_next_best_node(local_node, used_mask)) = 0) { + int distance = node_distance(local_node, node); + if (distance RECLAIM_DISTANCE) + zone_reclaim_mode = 1; + if (distance != node_distance(local_node, prev_node)) + node_load[node] = load; + node_order[j++] = node; + prev_node = node; + load--; + } + /* calculate node order */ + for (i = 0; i MAX_NR_ZONES; i++) { + zonelist = pgdat-node_zonelists + i; + pos = 0; + for (zone_type = i; zone_type = 0; zone_type--) { + for (j = 0; j num_online_nodes(); j++) { + node = node_order[j]; + z = NODE_DATA(node)-node_zones[zone_type];
Re: regression with gammu on 2.6.21-rc7
On Tue, Apr 24, 2007 at 06:12:33PM -0700, Ray Lee wrote: On 4/24/07, Greg KH [EMAIL PROTECTED] wrote: Also, if you know how to use git, doing a 'git bisect' to try to track down the problem commit would be very helpful. Has to do with SIGIO, see this blog post: http://blog.cihar.com/archives/2007/04/24/kernel_2_6_21_hits_gammu/ Ah, thank you very much, that makes more sense as nothing changed in that usb-serial driver and I was starting to get a bit worried... thanks, greg k-h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Ramdisk Vs NFS
Hi, What is the primary difference between Ramdisk and NFS with respect to the wait_queue's? If I use ramdisk, every thing works fine, but with NFS (or you may read as 'no ramdisk') kernel/sched.c:__wake_up_common() routines has a problem. Basically the value of q-task_list-next is out of our memory range (not between 0xc000 and 0xF000), and this causes trouble of accessing non-existing memory. Why would this happen? Interesting thing is, this happens much before we even load the ramdisk drivers. Appreciate if any one has some insight into this. At least a pointer to where to start looking would be great. Thanks Siva - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPROUTE: Modify tc for new PRIO multiqueue behavior
Peter P Waskiewicz Jr wrote: From: Peter P Waskiewicz Jr [EMAIL PROTECTED] Modified tc so PRIO can now have a multiqueue parameter passed to it. This will turn on multiqueue behavior if a device has more than 1 queue. Also, running tc qdisc ls dev dev will display if multiqueue is on or off. Signed-off-by: Peter P. Waskiewicz Jr [EMAIL PROTECTED] --- include/linux/pkt_sched.h |1 + tc/q_prio.c |9 ++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h index d10f353..bab0b9e 100644 --- a/include/linux/pkt_sched.h +++ b/include/linux/pkt_sched.h @@ -99,6 +99,7 @@ struct tc_prio_qopt { int bands; /* Number of bands */ __u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority - PRIO band */ + unsigned short multiqueue; /* 0 for no mq, 1 for mq */ }; /* TBF section */ diff --git a/tc/q_prio.c b/tc/q_prio.c index d696e1b..55cb207 100644 --- a/tc/q_prio.c +++ b/tc/q_prio.c @@ -29,7 +29,7 @@ static void explain(void) { - fprintf(stderr, Usage: ... prio bands NUMBER priomap P1 P2...\n); + fprintf(stderr, Usage: ... prio [multiqueue] bands NUMBER priomap P1 P2...\n); } #define usage() return(-1) @@ -39,7 +39,7 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct n int ok=0; int pmap_mode = 0; int idx = 0; - struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 }}; + struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 },0}; while (argc 0) { if (strcmp(*argv, bands) == 0) { @@ -57,7 +57,9 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct n return -1; } pmap_mode = 1; - } else if (strcmp(*argv, help) == 0) { + } else if (strcmp(*argv, multiqueue) == 0) + opt.multiqueue = 1; + else if (strcmp(*argv, help) == 0) { explain(); return -1; } else { @@ -105,6 +107,7 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt) if (RTA_PAYLOAD(opt) sizeof(*qopt)) return -1; qopt = RTA_DATA(opt); + fprintf(f, multiqueue %s , qopt-multiqueue ? on : off); fprintf(f, bands %u priomap , qopt-bands); for (i=0; i=TC_PRIO_MAX; i++) fprintf(f, %d, qopt-priomap[i]); Only if this binary compatiable with older kernels. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/7] Containers (V8): Cpusets hooked into containers
On 4/23/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote: config CONTAINERS - bool Container support - help - This option will let you create and manage process containers, - which can be used to aggregate multiple processes, e.g. for - the purposes of resource tracking. - - Say N if unsure + bool Hi Paul, This looks like some patch generation error. Description for containers should not be removed after applying this patch. No, this is intentional - in the first patch in the series, CONFIG_CONTAINER was a user-selectable option so it had a description; in the second it becomes an option that's only selected if other selected systems (e.g. cpusets) depend on it. So it no longer needs help text. Cheers, Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [PATCH 0/7] Containers (V8): Generic Process Containers
On 4/23/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote: Hi Paul, In [patch 3/7] Containers (V8): Add generic multi-subsystem API to containers, you have forcefully enabled interrupt in container_init_subsys() with spin_unlock_irq() which breaks on PPC64. +static void container_init_subsys(struct container_subsys *ss) { + int retval; + struct list_head *l; + printk(KERN_ERR Initializing container subsys %s\n, ss-name); + + /* Create the top container state for this subsystem */ + ss-root = rootnode; + retval = ss-create(ss, dummytop); + BUG_ON(retval); + init_container_css(ss, dummytop); + + /* Update all container groups to contain a subsys + * pointer to this state - since the subsystem is + * newly registered, all tasks and hence all container + * groups are in the subsystem's top container. */ + spin_lock_irq(container_group_lock); + l = init_container_group.list; + do { + struct container_group *cg = + list_entry(l, struct container_group, list); + cg-subsys[ss-subsys_id] = dummytop-subsys[ss-subsys_id]; + l = l-next; + } while (l != init_container_group.list); + spin_unlock_irq(container_group_lock); Interrupt gets enabled here and on PPC64, the kernel takes a pending decrementer and crashes because it is too early to handle them. Use of irqsave and restore routines would fix the problem. OK, thanks. I'll add that change. Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SOME STUFF ABOUT REISER4 To Mr Hopper
Seems I did not answer the correct thread. On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper [EMAIL PROTECTED] said: I did. That whole thread is some guy spouting off a ludicrous Bonnie++ benchmark showing that compressing long strings of 0s results in things taking up very little space and being very fast. I think you are deliberately being stupid here. You are claiming that REISER4's good speed results when using compression actually has a simple explanation and THEREFORE all good results for the filesystem, even those results that have nothing to do with compression, are negated. NOTHING COULD BE FURTHER FROM THE TRUTH. Your conclusion is a total travesty of logic. As I understand it, the default Reiser4 DOES NOT USE any compression at all, not even tail compression, but saves space by eliminating block alignment wastage (tail compression is an option). So lets LOSE the statistics that involve compression. The results now look like this: .-. | FILESYSTEM | TIME |DISK | | TYPE |(secs)|USAGE| .-. |REISER4 | 3462 | 692 | |EXT2| 4092 | 816 | |JFS | 4225 | 806 | |EXT4| 4408 | 816 | |EXT3| 4421 | 816 | |XFS | 4625 | 779 | |REISER3 | 6178 | 793 | |FAT32 |12342 | 988 | |NTFS-3g |10414 | 772 | .-. These results are still EXTREMELY GOOD for REISER4. These results still say that Reiser4 is a truly remarkable filesystem, as stated in: http://linuxhelp.150m.com/resources/fs-benchmarks.htm http://m.domaindlx.com/LinuxHelp/fs-benchmarks.htm So why do I see an anti-Reiser religion, in all that you people say. You, concentrate on the fact that bonnie++'s use of files that are mainly zeroes, will make the results using compression less good than they are. I can't see anywhere where this has been denied. In fact the other set of statistics that you just ignore, states that in more realistic situations, the compression speedup is slightly negative. What is wrong here, is: You say that the Bonnie++ tests using compression are subject to interpretation. No argument here. You ignore the tests that confirm your statement. You are clearly not interested in the actual results or their interpretation. You, by some incredibly twisted logic then state that Reiser4 is therefore not good, even though it is clearly the best filesystem when NOT using compression. This of course is completely deceitful logic. That the speed advantage from compression would be small is clear from the OTHER data that you ignore, namely: .-. |File |Disk |Copy |Copy |Tar |Unzip| Del | |System |Usage|655MB|655MB|Gzip |UnTar| 2.5 | |Type | (MB)| (1) | (2) |655MB|655MB| Gig | .-. |REISER4 lzo | 278 | 138 | 56 | 80 | 34 | 84 | |REISER4 gzip | 213 | 148 | 68 | 83 | 48 | 70 | |REISER4 | 692 | 148 | 55 | 67 | 25 | 56 | |EXT4 | 816 | 174 | 70 | 74 | 42 | 50 | .-. On Sun, 22 Apr 2007 19:00:46 -0700, Eric Hopper [EMAIL PROTECTED] said: I know that this whole effort has been put in disarray by the prosecution of Hans Reiser, but I'm curious as to its status. Is Reiser4 going to be going into the Linus kernel anytime soon? Is there somewhere I should be looking to find this out without wasting bandwidth here? There was a thread the other day, that talked about Reiser4. It took a while but I have found it (actually two) http://lkml.org/lkml/2007/4/5/360 http://lkml.org/lkml/2007/4/9/4 You may want to check them out. I did. That whole thread is some guy spouting off a ludicrous Bonnie++ benchmark showing that compressing long strings of 0s results in things taking up very little space and being very fast. Such things will produce lots of flames and no useful information whatsoever as is evinced by the half conspiracy theory, half truth the thread degenerated into in the second message you linked to. -- [EMAIL PROTECTED] -- http://www.fastmail.fm - Accessible with your email software or over the web - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] use mutex instead of semaphore in RocketPort driver
Hi Matthias, On 4/25/07, Robert Hancock [EMAIL PROTECTED] wrote: Matthias Kaehlcke wrote: El Tue, Apr 24, 2007 at 07:53:04PM +0200 Oliver Neukum ha dit: Am Dienstag, 24. April 2007 19:49 schrieb Matthias Kaehlcke: @@ -1706,7 +1706,7 @@ static int rp_write(struct tty_struct *tty, if (count = 0 || rocket_paranoia_check(info, rp_write)) return 0; - down_interruptible(info-write_sem); + mutex_lock_interruptible(info-write_mtx); This is a bug. It is also present in the current code, but nevertheless it is a bug. If you use an interruptible lock, you must be ready to deal with interrupts, which are ignored by this code. [...] i'm a bit confused now about the interruptible locks, i thought using them means that the process will be waked up when receiving a signal. what role are playing interrupts when using interruptible locks? You are correct, interrupts aren't involved. However if the wait is interrupted by a signal, mutex_lock_interruptible will return a nonzero return code which needs to be checked for (and likely -ERESTARTSYS or -EINTR returned), otherwise the code will blindly continue as though it has locked the mutex even though it has not. Think I'll elaborate Robert's explanation for your benefit :-) Unlike mutex_lock() and down() that put the task to TASK_UNINTERRUPTIBLE sleep if the lock can't be acquired immediately, mutex_lock_interruptible() and down_interruptible() sleep in TASK_INTERRUPTIBLE state. So the task _can_ be woken up (without even acquiring the lock) by incoming signals. When that happens, we can't just blindly go on ... so the return values of the _interruptible() versions of the locking functions *must* be checked for success and if not, the task should return with error. Use -ERESTARTSYS if a previous intermediate caller checks this return value and tries and restarts the whole operation. If no such previous caller exists (and/or introducing it would involve a change in kernel behaviour as seen from userspace), you can safely use -EINTR. The goal is that userspace must not get to see -ERESTARTSYS. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- INFO: possible recursive locking detected
[ 59.677312] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 59.688633] NFSD: starting 90-second grace period [ 60.221454] [ 60.221456] = [ 60.221461] [ INFO: possible recursive locking detected ] [ 60.221464] 2.6.21-rc7-mm1 #53 [ 60.221466] - [ 60.221469] S20powernowd/3584 is trying to acquire lock: [ 60.221472] (sd-s_active){}, at: [c01a2436] sysfs_hash_and_remove+0x91/0x10e [ 60.221486] [ 60.221487] but task is already holding lock: [ 60.221489] (sd-s_active){}, at: [c01a2a20] sysfs_write_file+0xb9/0x14a [ 60.221496] [ 60.221497] other info that might help us debug this: [ 60.221499] 4 locks held by S20powernowd/3584: [ 60.221501] #0: (sd-s_active){}, at: [c01a2a20] sysfs_write_file+0xb9/0x14a [ 60.221508] #1: (sd-s_active){}, at: [c01a2a32] sysfs_write_file+0xcb/0x14a [ 60.221515] #2: (per_cpu(cpu_policy_rwsem, cpu)){--..}, at: [c024081b] lock_policy_rwsem_write+0x20/0x37 [ 60.221524] #3: (userspace_mutex){--..}, at: [c0299dfe] mutex_lock+0x1f/0x23 [ 60.221534] [ 60.221535] stack backtrace: [ 60.221538] [c0104e0f] show_trace_log_lvl+0x1a/0x30 [ 60.221543] [c0105a26] show_trace+0x12/0x14 [ 60.221547] [c0105ab3] dump_stack+0x16/0x18 [ 60.221551] [c0134d63] __lock_acquire+0x12e/0xb4c [ 60.221557] [c01357e9] lock_acquire+0x68/0x82 [ 60.221561] [c012ddda] down_write+0x3a/0x53 [ 60.221567] [c01a2436] sysfs_hash_and_remove+0x91/0x10e [ 60.221571] [c01a2bb0] sysfs_remove_file+0x10/0x12 [ 60.221575] [c0241756] cpufreq_governor_userspace+0x10c/0x1dc [ 60.221579] [c023fd2b] __cpufreq_governor+0x9c/0xd0 [ 60.221583] [c023fed0] __cpufreq_set_policy+0x171/0x209 [ 60.221587] [c02400b5] store_scaling_governor+0x14d/0x184 [ 60.221591] [c0240bee] store+0x3e/0x60 [ 60.221594] [c01a2a85] sysfs_write_file+0x11e/0x14a [ 60.221599] [c01699fb] vfs_write+0x90/0x119 [ 60.221605] [c0169eef] sys_write+0x3d/0x61 [ 60.221609] [c0103e66] sysenter_past_esp+0x5f/0x99 [ 60.221613] === [ 60.763809] Clocksource tsc unstable (delta = -75646443 ns) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- INFO: possible recursive locking detected
Miles Lane wrote: [ 59.677312] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 59.688633] NFSD: starting 90-second grace period [ 60.221454] [ 60.221456] = [ 60.221461] [ INFO: possible recursive locking detected ] [ 60.221464] 2.6.21-rc7-mm1 #53 [ 60.221466] - [ 60.221469] S20powernowd/3584 is trying to acquire lock: [ 60.221472] (sd-s_active){}, at: [c01a2436] sysfs_hash_and_remove+0x91/0x10e [ 60.221486] [ 60.221487] but task is already holding lock: [ 60.221489] (sd-s_active){}, at: [c01a2a20] sysfs_write_file+0xb9/0x14a [ 60.221496] [ 60.221497] other info that might help us debug this: [ 60.221499] 4 locks held by S20powernowd/3584: [ 60.221501] #0: (sd-s_active){}, at: [c01a2a20] sysfs_write_file+0xb9/0x14a [ 60.221508] #1: (sd-s_active){}, at: [c01a2a32] sysfs_write_file+0xcb/0x14a [ 60.221515] #2: (per_cpu(cpu_policy_rwsem, cpu)){--..}, at: [c024081b] lock_policy_rwsem_write+0x20/0x37 [ 60.221524] #3: (userspace_mutex){--..}, at: [c0299dfe] mutex_lock+0x1f/0x23 Thanks for reporting. We need to separate s_active users into two classes - one for r/w the other for deleting for nodes which delete other nodes when written to. Will post a patch soon. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- software suspend failed (1 tasks refusing to freeze)
[ 1251.506964] PM: Preparing system for mem sleep [ 1251.514790] Stopping tasks ... [ 1271.456065] Stopping user space processes timed out after 20 seconds (1 tasks refusing to freeze): [ 1271.456243] multiload-apple [ 1271.456291] Restarting tasks ... done. This isn't happening under earlier builds I've tested. How can I debug this? Thanks, Miles - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] drivers/net: move the nvidia forcedeth driver from 100M group to 1000M group
nForce ehternet is a Gigabit NIC not 100M, move it to 1000M group to avoid the confusion. Signed-off-by: Peer Chen [EMAIL PROTECTED] --- linux-2.6.21-rc7/drivers/net/Kconfig.orig +++ linux-2.6.21-rc7/drivers/net/Kconfig @@ -1399,35 +1399,6 @@ config B44 file:Documentation/networking/net-modules.txt. The module will be called b44. -config FORCEDETH - tristate nForce Ethernet support - depends on NET_PCI PCI - help - If you have a network (Ethernet) controller of this type, say Y and - read the Ethernet-HOWTO, available from - http://www.tldp.org/docs.html#howto. - - To compile this driver as a module, choose M here and read - file:Documentation/networking/net-modules.txt. The module will be - called forcedeth. - -config FORCEDETH_NAPI - bool Use Rx and Tx Polling (NAPI) (EXPERIMENTAL) - depends on FORCEDETH EXPERIMENTAL - help - NAPI is a new driver API designed to reduce CPU and interrupt load - when the driver is receiving lots of packets from the card. It is - still somewhat experimental and thus not yet enabled by default. - - If your estimated Rx load is 10kpps or more, or if the card will be - deployed on potentially unfriendly networks (e.g. in a firewall), - then say Y here. - - See file:Documentation/networking/NAPI_HOWTO.txt for more - information. - - If in doubt, say N. - config CS89x0 tristate CS89x0 support depends on NET_PCI (ISA || MACH_IXDP2351 || ARCH_IXDP2X01 || ARCH_PNX010X) @@ -1999,6 +1970,35 @@ config MYRI_SBUS To compile this driver as a module, choose M here: the module will be called myri_sbus. This is recommended. +config FORCEDETH + tristate nForce Ethernet support + depends on NET_PCI PCI + help + If you have a network (Ethernet) controller of this type, say Y and + read the Ethernet-HOWTO, available from + http://www.tldp.org/docs.html#howto. + + To compile this driver as a module, choose M here and read + file:Documentation/networking/net-modules.txt. The module will be + called forcedeth. + +config FORCEDETH_NAPI + bool Use Rx and Tx Polling (NAPI) (EXPERIMENTAL) + depends on FORCEDETH EXPERIMENTAL + help + NAPI is a new driver API designed to reduce CPU and interrupt load + when the driver is receiving lots of packets from the card. It is + still somewhat experimental and thus not yet enabled by default. + + If your estimated Rx load is 10kpps or more, or if the card will be + deployed on potentially unfriendly networks (e.g. in a firewall), + then say Y here. + + See file:Documentation/networking/NAPI_HOWTO.txt for more + information. + + If in doubt, say N. + config NS83820 tristate National Semiconductor DP83820 support depends on PCI --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- software suspend failed (1 tasks refusing to freeze)
On Tue, 24 Apr 2007 22:27:44 -0700 Miles Lane [EMAIL PROTECTED] wrote: [ 1251.506964] PM: Preparing system for mem sleep [ 1251.514790] Stopping tasks ... [ 1271.456065] Stopping user space processes timed out after 20 seconds (1 tasks refusing to freeze): [ 1271.456243] multiload-apple [ 1271.456291] Restarting tasks ... done. This isn't happening under earlier builds I've tested. How can I debug this? hm, that's multiload-applet, some gnome thing. sysrq-T, perhaps? Perhaps the process is sleeping in the kernel somewhere. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- software suspend failed (1 tasks refusing to freeze)
On 4/24/07, Andrew Morton [EMAIL PROTECTED] wrote: On Tue, 24 Apr 2007 22:27:44 -0700 Miles Lane [EMAIL PROTECTED] wrote: [ 1251.506964] PM: Preparing system for mem sleep [ 1251.514790] Stopping tasks ... [ 1271.456065] Stopping user space processes timed out after 20 seconds (1 tasks refusing to freeze): [ 1271.456243] multiload-apple [ 1271.456291] Restarting tasks ... done. This isn't happening under earlier builds I've tested. How can I debug this? hm, that's multiload-applet, some gnome thing. sysrq-T, perhaps? Perhaps the process is sleeping in the kernel somewhere. Should I wait for the next patch from Tejun before retesting? Perhaps this suspend problem is a side effect of the locking problem he mentioned. Miles - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc7-mm1 + sysfs-oops-workaround.patch -- software suspend failed (1 tasks refusing to freeze)
On Tue, 24 Apr 2007 22:49:48 -0700 Miles Lane [EMAIL PROTECTED] wrote: On 4/24/07, Andrew Morton [EMAIL PROTECTED] wrote: On Tue, 24 Apr 2007 22:27:44 -0700 Miles Lane [EMAIL PROTECTED] wrote: [ 1251.506964] PM: Preparing system for mem sleep [ 1251.514790] Stopping tasks ... [ 1271.456065] Stopping user space processes timed out after 20 seconds (1 tasks refusing to freeze): [ 1271.456243] multiload-apple [ 1271.456291] Restarting tasks ... done. This isn't happening under earlier builds I've tested. How can I debug this? hm, that's multiload-applet, some gnome thing. sysrq-T, perhaps? Perhaps the process is sleeping in the kernel somewhere. Should I wait for the next patch from Tejun before retesting? Perhaps this suspend problem is a side effect of the locking problem he mentioned. It's unlikely to be related to Tejun's sysfs changes. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
> Since we need to have some way to track them having an explicit data > structure that the callers manage seems to make sense. Oh sure, I wasn't arguing against that at all... It might be handy to have a release() callback (optional) that gets called after the kthread stops/exits, once we know the data structure isn't going to be used anymore (if practical to implement, depends on your approach). Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] oom: kill all threads that share mm with killed task
On Mon, 23 Apr 2007, Christoph Lameter wrote: > Obvious fix. It was broken by > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584 > Dec 7. So its in 2.6.20 and later. Candiate for stable? > I agree it's obvious enough that it should be included in stable. Otherwise the entire iteration becomes a big no-op and it won't alleviate the OOM condition in one call to out_of_memory() because there may be outstanding tasks with the shared ->mm. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Transparently handle <.symbol> lookup for kprobes
Srinivasa Ds writes: > + } else {\ > + char dot_name[KSYM_NAME_LEN+1]; \ > + dot_name[0] = '.'; \ > + dot_name[1] = '\0'; \ > + strncat(dot_name, name, KSYM_NAME_LEN); \ Assuming the kernel strncat works like the userspace one does, there is a possibility that dot_name[] won't be properly null-terminated here. If strlen(name) >= KSYM_NAME_LEN-1, then strncat will set dot_name[KSYM_NAME_LEN-1] to something non-null and won't touch dot_name[KSYM_NAME_LEN]. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
Christoph Hellwig writes: > The first question is obviously, is this really something we want? > spawning kernel thread on demand without reaping them properly seems > quite dangerous. What specifically has to be done to reap a kernel thread? Are you concerned about the number of threads, or about having zombies hanging around? Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SOME STUFF ABOUT REISER4
On Sun, 22 Apr 2007 19:00:46 -0700, "Eric Hopper" <[EMAIL PROTECTED]> said: > I know that this whole effort has been put in disarray by the > prosecution of Hans Reiser, but I'm curious as to its status. Is > Reiser4 going to be going into the Linus kernel anytime soon? Is there > somewhere I should be looking to find this out without wasting bandwidth > here? There was a thread the other day, that talked about Reiser4. It took a while but I have found it (actually two) http://lkml.org/lkml/2007/4/5/360 http://lkml.org/lkml/2007/4/9/4 You may want to check them out. -- [EMAIL PROTECTED] -- http://www.fastmail.fm - Access your email from home and the web - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/25] xen: Add nosegneg capability to the vsyscall page notes
Roland McGrath wrote: >> I have to admit I still don't really understand all this. Is it >> documented somewhere? >> > > I have explained it in public more than once, but I don't know off hand > anywhere that was helpfully recorded. > Thanks very much. I'd been poking about, but the closest I came to an actual description was various patches fixing bugs, so it was a little incomplete. > For example, a Xen-enabled kernel can use a single vDSO image (or a single > pair of int80/sysenter images), containing the "nosegneg" hwcap note. When > there is no need for it (native or hvm or 64-bit hv or whatever), it just > clears the mask word. If you actually do this, you'll want to modify the > NOTE_KERNELCAP_BEGIN macro to define a global label you can use with VDSO_SYM. > Thanks for the pointer. I'd been getting a bit of heat for enabling the nonegseg flag unconditionally. If I can make Xen-specific then that will be one less source of complaints. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Arjan van de Ven wrote: Within reason, it's not the number of clients that X has that causes its CPU bandwidth use to sky rocket and cause problems. It's more to to with what type of clients they are. Most GUIs (even ones that are constantly updating visual data (e.g. gkrellm -- I can open quite a large number of these without increasing X's CPU usage very much)) cause very little load on the X server. The exceptions to this are the there is actually 2 and not just 1 "X server", and they are VERY VERY different in behavior. Case 1: Accelerated driver If X talks to a decent enough card it supports will with acceleration, it will be very rare for X itself to spend any kind of significant amount of CPU time, all the really heavy stuff is done in hardware, and asynchronously at that. A bit of batching will greatly improve system performance in this case. Case 2: Unaccelerated VESA Some drivers in X, especially the VESA and NV drivers (which are quite common, vesa is used on all hardware without a special driver nowadays), have no or not enough acceleration to matter for modern desktops. This means the CPU is doing all the heavy lifting, in the X program. In this case even a simple "move the window a bit" becomes quite a bit of a CPU hog already. Mine's a: SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according to X's display settings tool. Which category does that fall into? It's not a special adapter and is just the one that came with the motherboard. It doesn't use much CPU unless I grab a window and wiggle it all over the screen or do something like "ls -lR /" in an xterm. The cases are fundamentally different in behavior, because in the first case, X hardly consumes the time it would get in any scheme, while in the second case X really is CPU bound and will happily consume any CPU time it can get. Which still doesn't justify an elaborate "points" sharing scheme. Whichever way you look at that that's just another way of giving X more CPU bandwidth and there are simpler ways to give X more CPU if it needs it. However, I think there's something seriously wrong if it needs the -19 nice that I've heard mentioned. You might as well just run it as a real time process. Peter -- Peter Williams [EMAIL PROTECTED] "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NonExecutable Bit in 32Bit
Hey, is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch? When yes, is there a special argument for it not to be used? Ciao Thilo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] x86_64: Reflect the relocatability of the kernel in the ELF header.
On Sun, Apr 22, 2007 at 11:12:13PM -0600, Eric W. Biederman wrote: > > Currently because vmlinux does not reflect that the kernel is relocatable > we still have to support CONFIG_PHYSICAL_START. So this patch adds a small > c program to do what we cannot do with a linker script, set the elf header > type to ET_DYN. > > This should remove the last obstacle to removing CONFIG_PHYSICAL_START > on x86_64. > > Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> [Dropping fastboot mailing list from CC as kexec mailing list is new list for this discussion] [..] > +void file_open(const char *name) > +{ > + if ((fd = open(name, O_RDWR, 0)) < 0) > + die("Unable to open `%s': %m", name); > +} > + > +static void mketrel(void) > +{ > + unsigned char e_type[2]; > + if (read(fd, _ident, sizeof(e_ident)) != sizeof(e_ident)) > + die("Cannot read ELF header: %s\n", strerror(errno)); > + > + if (memcmp(e_ident, ELFMAG, 4) != 0) > + die("No ELF magic\n"); > + > + if ((e_ident[EI_CLASS] != ELFCLASS64) && > + (e_ident[EI_CLASS] != ELFCLASS32)) > + die("Unrecognized ELF class: %x\n", e_ident[EI_CLASS]); > + > + if ((e_ident[EI_DATA] != ELFDATA2LSB) && > + (e_ident[EI_DATA] != ELFDATA2MSB)) > + die("Unrecognized ELF data encoding: %x\n", e_ident[EI_DATA]); > + > + if (e_ident[EI_VERSION] != EV_CURRENT) > + die("Unknown ELF version: %d\n", e_ident[EI_VERSION]); > + > + if (e_ident[EI_DATA] == ELFDATA2LSB) { > + e_type[0] = ET_REL & 0xff; > + e_type[1] = ET_REL >> 8; > + } else { > + e_type[1] = ET_REL & 0xff; > + e_type[0] = ET_REL >> 8; > + } Hi Eric, Should this be ET_REL or ET_DYN? kexec refuses to load this vmlinux as it does not find it to be executable type. I am not well versed with various conventions but if I go through "Executable and Linking Format" document, this is what it says about various file types. • A relocatable file holds code and data suitable for linking with other object files to create an executable or a shared object file. • An executable file holds a program suitable for execution. • A shared object file holds code and data suitable for linking in two contexts. First, the link editor may process it with other relocatable and shared object files to create another object file. Second, the dynamic linker combines it with an executable file and other shared objects to create a process image. So above does not seem to fit in the ET_REL type. We can't relink this vmlinux? And it does not seem to fit in ET_DYN definition too. We are not relinking this vmlinux with another executable or other relocatable files. I remember once you mentioned the term dynamic executable which can be loaded at a non-compiled address and let run without requiring any relocation processing. This vmlinux will fall in that category but can't relate it to standard elf file definitions. Thanks Vivek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Peter Williams <[EMAIL PROTECTED]> wrote: > > The cases are fundamentally different in behavior, because in the > > first case, X hardly consumes the time it would get in any scheme, > > while in the second case X really is CPU bound and will happily > > consume any CPU time it can get. > > Which still doesn't justify an elaborate "points" sharing scheme. > Whichever way you look at that that's just another way of giving X > more CPU bandwidth and there are simpler ways to give X more CPU if it > needs it. However, I think there's something seriously wrong if it > needs the -19 nice that I've heard mentioned. Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. So CFS does not 'need' a reniced X. There are simply advantages to negative nice levels: for example screen refreshes are smoother on any scheduler i tried. BUT, there is a caveat: on non-CFS schedulers i tried X is much more prone to get into 'overscheduling' scenarios that visibly hurt X's performance, while on CFS there's a max of 1000-1500 context switches a second at nice -10. (which, considering the cost of a context switch is well under 1% overhead.) So, my point is, the nice level of X for desktop users should not be set lower than a low limit suggested by that particular scheduler's author. That limit is scheduler-specific. Con i think recommends a nice level of -1 for X when using SD [Con, can you confirm?], while my tests show that if you want you can go as low as -10 under CFS, without any bad side-effects. (-19 was a bit too much) > [...] You might as well just run it as a real time process. hm, that would be a bad idea under any scheduler (including CFS), because real time processes can starve other processes indefinitely. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NonExecutable Bit in 32Bit
On Tue, 24 Apr 2007, Cestonaro, Thilo (external) wrote: Hey, is it right, that the NX Bit is not used under i386-Arch but under x86_64-Arch? When yes, is there a special argument for it not to be used? Ciao Thilo I don't think so - some i386 cpus definitely have support for the NX bit. Would having this be supported in i386 help debugging (and security) significantly? William Heimbigner [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] Fixes and cleanups for earlyprintk aka boot console.
On Thu, 15 Mar 2007 16:46:39 +0100 Gerd Hoffmann <[EMAIL PROTECTED]> wrote: > The console subsystem already has an idea of a boot console, using the > CON_BOOT flag. The implementation has some flaws though. The major > problem is that presence of a boot console makes register_console() > ignore any other console devices (unless explicitly specified on the > kernel command line). > > This patch fixes the console selection code to *not* consider a boot > console a full-featured one, so the first non-boot console registering > will become the default console instead. This way the unregister call > for the boot console in the register_console() function actually > triggers and the handover from the boot console to the real console > device works smoothly. Added a printk for the handover, so you know > which console device the output goes to when the boot console stops > printing messages. > > The disable_early_printk() call is obsolete with that patch, explicitly > disabling the early console isn't needed any more as it works > automagically with that patch. > > I've walked through the tree, dropped all disable_early_printk() > instances found below arch/ and tagged the consoles with CON_BOOT if > needed. The code is tested on x86, sh (thanks to Paul) and mips > (thanks to Ralf). > > Changes to last version: Rediffed against -rc3, adapted to mips > cleanups by Ralf, fixed "udbg-immortal" cmd line arg on powerpc. I get this, across netconsole: [17179569.184000] console handover: boot [earlyvga_f_0] -> real [tty0] wanna take a look at why there's cruft in bootconsole->name please? in grub.conf I have kernel /boot/bzImage-2.6.21-rc7-mm1 ro root=LABEL=/ rhgb vga=0x263 [EMAIL PROTECTED]/eth0,[EMAIL PROTECTED]/00:0D:56:C6:C6:CC profile=1 earlyprintk=vga resume=8:5 time and I'm using http://userweb.kernel.org/~akpm/config-sony.txt Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Transparently handle <.symbol> lookup for kprobes
Paul Mackerras wrote: > Srinivasa Ds writes: > >> +} else {\ >> +char dot_name[KSYM_NAME_LEN+1]; \ >> +dot_name[0] = '.'; \ >> +dot_name[1] = '\0'; \ >> +strncat(dot_name, name, KSYM_NAME_LEN); \ > > Assuming the kernel strncat works like the userspace one does, there > is a possibility that dot_name[] won't be properly null-terminated > here. If strlen(name) >= KSYM_NAME_LEN-1, then strncat will set > dot_name[KSYM_NAME_LEN-1] to something non-null and won't touch > dot_name[KSYM_NAME_LEN]. Irrespective of length of the string, kernel implementation of strncat(lib/string.c) ensures that last character of string is set to null. So dot_name[] is always null terminated. char *strncat(char *dest, const char *src, size_t count) { char *tmp = dest; if (count) { while (*dest) dest++; while ((*dest++ = *src++) != 0) { if (--count == 0) { *dest = '\0'; break; } } } return tmp; } EXPORT_SYMBOL(strncat); === Is this OK then ?? Thanks Srinivasa DS > > Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] Ignore stolen time in the softlockup watchdog
On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > The softlockup watchdog is currently a nuisance in a virtual machine, > since the whole system could have the CPU stolen from it for a long > period of time. While it would be unlikely for a guest domain to be > denied timer interrupts for over 10s, it could happen and any softlockup > message would be completely spurious. > > Earlier I proposed that sched_clock() return time in unstolen > nanoseconds, which is how Xen and VMI currently implement it. If the > softlockup watchdog uses sched_clock() to measure time, it would > automatically ignore stolen time, and therefore only report when the > guest itself locked up. When running native, sched_clock() returns > real-time nanoseconds, so the behaviour would be unchanged. > > Note that sched_clock() used this way is inherently per-cpu, so this > patch makes sure that the per-processor watchdog thread initialized > its own timestamp. This patch (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch) causes six failures in the locking self-tests, which I must say is rather clever of it. Here's the first one: [17179569.184000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar [17179569.184000] ... MAX_LOCKDEP_SUBCLASSES:8 [17179569.184000] ... MAX_LOCK_DEPTH: 30 [17179569.184000] ... MAX_LOCKDEP_KEYS:2048 [17179569.184000] ... CLASSHASH_SIZE: 1024 [17179569.184000] ... MAX_LOCKDEP_ENTRIES: 8192 [17179569.184000] ... MAX_LOCKDEP_CHAINS: 16384 [17179569.184000] ... CHAINHASH_SIZE: 8192 [17179569.184000] memory used by lock dependency info: 992 kB [17179569.184000] per task-struct memory footprint: 1200 bytes [17179569.184000] [17179569.184000] | Locking API testsuite: [17179569.184000] [17179569.184000] | spin |wlock |rlock |mutex | wsem | rsem | [17179569.184000] -- [17179569.184000] A-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184000] A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184000] A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184001] A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | [17179569.184002] A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184003] A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184004] A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | [17179569.184005] double unlock: ok | ok | ok | ok | ok | ok | [17179569.184006] initialize held: ok | ok | ok | ok | ok | ok | [17179569.184006] bad unlock order: ok | ok | ok | ok | ok | ok | [17179569.184006] -- [17179569.184006] recursive read-lock: | ok | | ok | [17179569.184006]recursive read-lock #2: | ok | | ok | [17179569.184007] mixed read-write-lock: | ok | | ok | [17179569.184007] mixed write-read-lock: | ok | | ok | [17179569.184007] -- [17179569.184007] hard-irqs-on + irq-safe-A/12: ok | ok | ok | [17179569.184007] soft-irqs-on + irq-safe-A/12: ok | ok | ok | [17179569.184007] hard-irqs-on + irq-safe-A/21: ok | ok | ok | [17179569.184007] soft-irqs-on + irq-safe-A/21: ok | ok | ok | [17179569.184007]sirq-safe-A => hirqs-on/12: ok | ok |irq event stamp: 458 [17179569.184007] hardirqs last enabled at (458): [] irqsafe2A_rlock_12+0x96/0xa3 [17179569.184007] hardirqs last disabled at (457): [] sched_clock+0x5e/0xe9 [17179569.184007] softirqs last enabled at (454): [] irqsafe2A_rlock_12+0x81/0xa3 [17179569.184007] softirqs last disabled at (450): [] irqsafe2A_rlock_12+0xb/0xa3 [17179569.184007] FAILED| [] dump_trace+0x63/0x1ec [17179569.184007] [] show_trace_log_lvl+0x1a/0x30 [17179569.184007] [] show_trace+0x12/0x14 [17179569.184007] [] dump_stack+0x16/0x18 [17179569.184007] [] dotest+0x6b/0x3d0 [17179569.184007] [] locking_selftest+0x915/0x1a58 [17179569.184007] [] start_kernel+0x1d0/0x2a2 [17179569.184007] === [17179569.184007] [17179569.184007]sirq-safe-A => hirqs-on/21:irq event stamp: 462 [17179569.184007] hardirqs last enabled at (462): []
Re: [REPORT] First "glitch1" results, 2.6.21-rc7-git6-CFSv5 + SD 0.46
* Ed Tomlinson <[EMAIL PROTECTED]> wrote: > > SD 0.46 1-2 FPS > > cfs v5 nice -19 219-233 FPS > > cfs v5 nice 0 1000-1996 >cfs v5 nice -10 60-65 FPS the problem is, the glxgears portion of this test is an _inverse_ testcase. The reason? glxgears on true 3D hardware will _not_ use X, it will directly use the 3D driver of the kernel. So by renicing X to -19 you give the xterms more chance to show stuff - the performance of the glxgears will 'degrade' - but that is what you asked for: glxgears is 'just another CPU hog' that competes with X, it's not a "true" X client. if you are after glxgears performance in this test then you'll get the best performance out of this by renicing X to +19 or even SCHED_BATCH. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] Ignore stolen time in the softlockup watchdog
Andrew Morton wrote: > On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> > wrote: > > >> The softlockup watchdog is currently a nuisance in a virtual machine, >> since the whole system could have the CPU stolen from it for a long >> period of time. While it would be unlikely for a guest domain to be >> denied timer interrupts for over 10s, it could happen and any softlockup >> message would be completely spurious. >> >> Earlier I proposed that sched_clock() return time in unstolen >> nanoseconds, which is how Xen and VMI currently implement it. If the >> softlockup watchdog uses sched_clock() to measure time, it would >> automatically ignore stolen time, and therefore only report when the >> guest itself locked up. When running native, sched_clock() returns >> real-time nanoseconds, so the behaviour would be unchanged. >> >> Note that sched_clock() used this way is inherently per-cpu, so this >> patch makes sure that the per-processor watchdog thread initialized >> its own timestamp. >> > > This patch > (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch) > causes six failures in the locking self-tests, which I must say is rather > clever of it. > Interesting. Which variation of sched_clock do you have in your tree at the moment? J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: >* Peter Williams <[EMAIL PROTECTED]> wrote: >> > The cases are fundamentally different in behavior, because in the >> > first case, X hardly consumes the time it would get in any scheme, >> > while in the second case X really is CPU bound and will happily >> > consume any CPU time it can get. >> >> Which still doesn't justify an elaborate "points" sharing scheme. >> Whichever way you look at that that's just another way of giving X >> more CPU bandwidth and there are simpler ways to give X more CPU if it >> needs it. However, I think there's something seriously wrong if it >> needs the -19 nice that I've heard mentioned. > >Gene has done some testing under CFS with X reniced to +10 and the >desktop still worked smoothly for him. As a data point here, and probably nothing to do with X, but I did manage to lock it up, solid, reset button time tonight, by wanting 'smart' to get done with an update session after amanda had started. I took both smart processes I could see in htop all the way to -19, but when it was about done about 3 minutes later, everything came to an instant, frozen, reset button required lockup. I should have stopped at -17 I guess. :( >So CFS does not 'need' a reniced >X. There are simply advantages to negative nice levels: for example >screen refreshes are smoother on any scheduler i tried. BUT, there is a >caveat: on non-CFS schedulers i tried X is much more prone to get into >'overscheduling' scenarios that visibly hurt X's performance, while on >CFS there's a max of 1000-1500 context switches a second at nice -10. >(which, considering the cost of a context switch is well under 1% >overhead.) > >So, my point is, the nice level of X for desktop users should not be set >lower than a low limit suggested by that particular scheduler's author. >That limit is scheduler-specific. Con i think recommends a nice level of >-1 for X when using SD [Con, can you confirm?], while my tests show that >if you want you can go as low as -10 under CFS, without any bad >side-effects. (-19 was a bit too much) > >> [...] You might as well just run it as a real time process. > >hm, that would be a bad idea under any scheduler (including CFS), >because real time processes can starve other processes indefinitely. > > Ingo -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I have discovered that all human evil comes from this, man's being unable to sit still in a room. -- Blaise Pascal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: NonExecutable Bit in 32Bit
> I don't think so - some i386 cpus definitely have support for the NX bit. Ok, the cpu's do support it, but the kernel doesn't use it if it is active in the bios. > Would having this be supported in i386 help debugging (and security) > significantly? @William: I don't understand this question :( Ciao Thilo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
Ingo Molnar wrote: static void yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to) { struct rb_node *curr, *next, *first; struct task_struct *p_next; /* * yield-to support: if we are on the same runqueue then * give half of our wait_runtime (if it's positive) to the other task: */ if (p_to && p->wait_runtime > 0) { p->wait_runtime >>= 1; p_to->wait_runtime += p->wait_runtime; } the above is the basic expression of: "charge a positive bank balance". [..] [note, due to the nanoseconds unit there's no rounding loss to worry about.] Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss? Ingo Rogan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Gene Heskett <[EMAIL PROTECTED]> wrote: > > Gene has done some testing under CFS with X reniced to +10 and the > > desktop still worked smoothly for him. > > As a data point here, and probably nothing to do with X, but I did > manage to lock it up, solid, reset button time tonight, by wanting > 'smart' to get done with an update session after amanda had started. > I took both smart processes I could see in htop all the way to -19, > but when it was about done about 3 minutes later, everything came to > an instant, frozen, reset button required lockup. I should have > stopped at -17 I guess. :( yeah, i guess this has little to do with X. I think in your scenario it might have been smarter to either stop, or to renice the workloads that took away CPU power from others to _positive_ nice levels. Negative nice levels can indeed be dangerous. (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] mm: per device dirty threshold
On Tue, 2007-04-24 at 12:58 +1000, Neil Brown wrote: > On Friday April 20, [EMAIL PROTECTED] wrote: > > Scale writeback cache per backing device, proportional to its writeout > > speed. > > So it works like this: > > We account for writeout in full pages. > When a page has the Writeback flag cleared, we account that as a > successfully retired write for the relevant bdi. > By using floating averages we keep track of how many writes each bdi > has retired 'recently' where the unit of time in which we understand > 'recently' is a single page written. That is actually that period I keep referring to. So recently is the last 'period' number of writeout completions. > We keep a floating average for each bdi, and a floating average for > the total writeouts (that 'average' is, of course, 1.) 1 in the sense of unity, yes :-) > Using these numbers we can calculate what faction of 'recently' > retired writes were retired by each bdi (get_writeout_scale). > > Multiplying this fraction by the system-wide number of pages that are > allowed to be dirty before write-throttling, we get the number of > pages that the bdi can have dirty before write-throttling the bdi. > > I note that the same fraction is *not* applied to background_thresh. > Should it be? I guess not - there would be interesting starting > transients, as a bdi which had done no writeout would not be allowed > any dirty pages, so background writeout would start immediately, > which isn't what you want... or is it? This is something I have not been able to come to a conclusive answer yet,... > For each bdi we also track the number of (dirty, writeback, unstable) > pages and do not allow this to exceed the limit set for this bdi. > > The calculations involving 'reserve' in get_dirty_limits are a little > confusing. It looks like you calculating how much total head-room > there is for the bdi (pages that the system can still dirty - pages > this bdi has dirty) and making sure the number returned in pbdi_dirty > doesn't allow more than that to be used. Yes, it limits the earned share of the total dirty limit to the possible share, ensuring that the total dirty limit is never exceeded. This is especially relevant when the proportions change faster than the pages get written out, ie. when the period << total dirty limit. > This is probably a > reasonable thing to do but it doesn't feel like the right place. I > think get_dirty_limits should return the raw threshold, and > balance_dirty_pages should do both tests - the bdi-local test and the > system-wide test. Ok, that makes sense I guess. > Currently you have a rather odd situation where > + if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > + break; > might included numbers obtained with bdi_stat_sum being compared with > numbers obtained with bdi_stat. Yes, I was aware of that. The bdi_thresh is based on bdi_stat() numbers, whereas the others could be bdi_stat_sum(). I think this is ok, since the threshold is a 'guess' anyway, we just _need_ to ensure we do not get trapped by writeouts not arriving (due to getting stuck in the per cpu deltas). -- I have all this commented in the new version. > With these patches, the VM still (I think) assumes that each BDI has > a reasonable queue limit, so that writeback_inodes will block on a > full queue. If a BDI has a very large queue, balance_dirty_pages > will simply turn lots of DIRTY pages into WRITEBACK pages and then > think "We've done our duty" without actually blocking at all. It will block once we exceed the total number of dirty pages allowed for that BDI. But yes, this does not take away the need for queue limits. This work was primarily aimed at allowing multiple queues to not interfere as much, so they all can make progress and not get starved. > With the extra accounting that we now have, I would like to see > balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is > actually less than 'threshold'. This would probably mean that we > would need to support per-bdi background_writeout to smooth things > out. Maybe that it fodder for another patch-set. Indeed, I still have to wrap my mind around the background thing. Your input is appreciated. > You set: > + vm_cycle_shift = 1 + ilog2(vm_total_pages); > > Can you explain that? You found the one random knob I hid :-) > My experience is that scaling dirty limits > with main memory isn't what we really want. When you get machines > with very large memory, the amount that you want to be dirty is more > a function of the speed of your IO devices, rather than the amount > of memory, otherwise you can sometimes see large filesystem lags > ('sync' taking minutes?) > > I wonder if it makes sense to try to limit the dirty data for a bdi > to the amount that it can write out in some period of time - maybe 3 > seconds. Probably configurable. You seem to
Re: [patch 1/4] Ignore stolen time in the softlockup watchdog
On Mon, 23 Apr 2007 23:58:20 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Tue, 27 Mar 2007 14:49:20 -0700 Jeremy Fitzhardinge <[EMAIL PROTECTED]> > > wrote: > > > > > >> The softlockup watchdog is currently a nuisance in a virtual machine, > >> since the whole system could have the CPU stolen from it for a long > >> period of time. While it would be unlikely for a guest domain to be > >> denied timer interrupts for over 10s, it could happen and any softlockup > >> message would be completely spurious. > >> > >> Earlier I proposed that sched_clock() return time in unstolen > >> nanoseconds, which is how Xen and VMI currently implement it. If the > >> softlockup watchdog uses sched_clock() to measure time, it would > >> automatically ignore stolen time, and therefore only report when the > >> guest itself locked up. When running native, sched_clock() returns > >> real-time nanoseconds, so the behaviour would be unchanged. > >> > >> Note that sched_clock() used this way is inherently per-cpu, so this > >> patch makes sure that the per-processor watchdog thread initialized > >> its own timestamp. > >> > > > > This patch > > (ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/broken-out/ignore-stolen-time-in-the-softlockup-watchdog.patch) > > causes six failures in the locking self-tests, which I must say is rather > > clever of it. > > > > Interesting. I'll say. > Which variation of sched_clock do you have in your tree at > the moment? Andi's, plus the below fix. Sigh. I thought I was only two more bugs away from a release, then... [18014389.347124] BUG: unable to handle kernel paging request at virtual address 6b6b7193 [18014389.347142] printing eip: [18014389.347149] c029a80c [18014389.347156] *pde = [18014389.347166] Oops: [#1] [18014389.347174] Modules linked in: i915 drm ipw2200 sonypi ipv6 autofs4 hidp l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables cpufreq_ondemand video sbs button battery asus_acpi ac nvram ohci1394 ieee1394 ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm sr_mod cdrom snd_timer ieee80211 i2c_i801 piix ieee80211_crypt i2c_core generic snd soundcore snd_page_alloc ext3 jbd ide_disk ide_core [18014389.347520] CPU:0 [18014389.347521] EIP:0060:[]Tainted: G D VLI [18014389.347522] EFLAGS: 00010296 (2.6.21-rc7-mm1 #35) [18014389.347547] EIP is at input_release_device+0x8/0x4e [18014389.347555] eax: c99709a8 ebx: 6b6b6b6b ecx: 0286 edx: [18014389.347563] esi: 6b6b6b6b edi: c99709cc ebp: c21e3d40 esp: c21e3d38 [18014389.347571] ds: 007b es: 007b fs: 00d8 gs: ss: 0068 [18014389.347580] Process khubd (pid: 159, ti=c21e2000 task=c20a62f0 task.ti=c21e2000) [18014389.347588] Stack: 6b6b6b6b c99709a8 c21e3d60 c029b489 c2014ec8 c9182000 c96b167c c9970954 [18014389.347655]c9970954 c99709cc c21e3d80 c029d401 c9977a6c c96b1000 c21e3d90 c9970954 [18014389.347708]c99709a8 c9164000 c21e3d90 c029d4b5 c96b1000 c9970564 c21e3db0 c029c50b [18014389.347771] Call Trace: [18014389.347792] [] input_close_device+0x13/0x51 [18014389.347810] [] mousedev_destroy+0x29/0x7e [18014389.347827] [] mousedev_disconnect+0x5f/0x63 [18014389.347842] [] input_unregister_device+0x6a/0x100 [18014389.347858] [] hidinput_disconnect+0x24/0x41 [18014389.347874] [] hid_disconnect+0x79/0xc9 [18014389.347889] [] usb_unbind_interface+0x47/0x8f [18014389.347916] [] __device_release_driver+0x74/0x90 [18014389.347933] [] device_release_driver+0x37/0x4e [18014389.347957] [] bus_remove_device+0x73/0x82 [18014389.347977] [] device_del+0x214/0x28c [18014389.348132] [] usb_disable_device+0x62/0xc2 [18014389.348148] [] usb_disconnect+0x99/0x126 [18014389.348163] [] hub_thread+0x3a5/0xb07 [18014389.348178] [] kthread+0x6e/0x79 [18014389.348194] [] kernel_thread_helper+0x7/0x10 [18014389.348210] === [18014389.348218] INFO: lockdep is turned off. [18014389.348224] Code: 5b 5d c3 55 b9 f0 ff ff ff 8b 50 0c 89 e5 83 ba 28 06 00 00 00 75 08 89 82 28 06 00 00 31 c9 5d 89 c8 c3 55 89 e5 56 53 8b 70 0c <39> 86 28 06 00 00 75 3a 8b 9e e4 08 00 00 c7 86 28 06 00 00 00 I dunno. I'll keep plugging for another couple hours then I'll shove out what I have as a -mm snapshot whatsit. Things are just ridiculous. I'm thinking of having a hard-disk crash and accidentally losing everything. From: Andrew Morton <[EMAIL PROTECTED]> WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to .init.text:sc_cpu_event from .data between 'sc_cpu_notifier' (at offset 0x2110) and 'mcelog' Use hotcpu_notifier(). This takes care of making sure that the unused code disappears from vmlinux if !CONFIG_HOTPLUG_CPU, too.
How do you send a reply to an email you have deleted.
How do you send a reply to an email you have deleted? -- [EMAIL PROTECTED] -- http://www.fastmail.fm - I mean, what is it about a decent email service? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tuesday 24 April 2007, Ingo Molnar wrote: >* Gene Heskett <[EMAIL PROTECTED]> wrote: >> > Gene has done some testing under CFS with X reniced to +10 and the >> > desktop still worked smoothly for him. >> >> As a data point here, and probably nothing to do with X, but I did >> manage to lock it up, solid, reset button time tonight, by wanting >> 'smart' to get done with an update session after amanda had started. >> I took both smart processes I could see in htop all the way to -19, >> but when it was about done about 3 minutes later, everything came to >> an instant, frozen, reset button required lockup. I should have >> stopped at -17 I guess. :( > >yeah, i guess this has little to do with X. I think in your scenario it >might have been smarter to either stop, or to renice the workloads that >took away CPU power from others to _positive_ nice levels. Negative nice >levels can indeed be dangerous. > >(Btw., to protect against such mishaps in the future i have changed the >SysRq-N [SysRq-Nice] implementation in my tree to not only change >real-time tasks to SCHED_OTHER, but to also renice negative nice levels >back to 0 - this will show up in -v6. That way you'd only have had to >hit SysRq-N to get the system out of the wedge.) > > Ingo That sounds handy, particularly with idiots like me at the wheel... -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) When a Banker jumps out of a window, jump after him--that's where the money is. -- Robespierre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Gene Heskett <[EMAIL PROTECTED]> wrote: > > (Btw., to protect against such mishaps in the future i have changed > > the SysRq-N [SysRq-Nice] implementation in my tree to not only > > change real-time tasks to SCHED_OTHER, but to also renice negative > > nice levels back to 0 - this will show up in -v6. That way you'd > > only have had to hit SysRq-N to get the system out of the wedge.) > > That sounds handy, particularly with idiots like me at the wheel... by that standard i guess we tinkerers are all idiots ;) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
On Tue, 24 Apr 2007, Ingo Molnar wrote: * Gene Heskett <[EMAIL PROTECTED]> wrote: Gene has done some testing under CFS with X reniced to +10 and the desktop still worked smoothly for him. As a data point here, and probably nothing to do with X, but I did manage to lock it up, solid, reset button time tonight, by wanting 'smart' to get done with an update session after amanda had started. I took both smart processes I could see in htop all the way to -19, but when it was about done about 3 minutes later, everything came to an instant, frozen, reset button required lockup. I should have stopped at -17 I guess. :( yeah, i guess this has little to do with X. I think in your scenario it might have been smarter to either stop, or to renice the workloads that took away CPU power from others to _positive_ nice levels. Negative nice levels can indeed be dangerous. (Btw., to protect against such mishaps in the future i have changed the SysRq-N [SysRq-Nice] implementation in my tree to not only change real-time tasks to SCHED_OTHER, but to also renice negative nice levels back to 0 - this will show up in -v6. That way you'd only have had to hit SysRq-N to get the system out of the wedge.) if you are trying to unwedge a system it may be a good idea to renice all tasks to 0, it could be that a task at +19 is holding a lock that something else is waiting for. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] x86_64: Reflect the relocatability of the kernel in the ELF header.
Vivek Goyal <[EMAIL PROTECTED]> writes: > On Sun, Apr 22, 2007 at 11:12:13PM -0600, Eric W. Biederman wrote: >> >> Currently because vmlinux does not reflect that the kernel is relocatable >> we still have to support CONFIG_PHYSICAL_START. So this patch adds a small >> c program to do what we cannot do with a linker script, set the elf header >> type to ET_DYN. >> >> This should remove the last obstacle to removing CONFIG_PHYSICAL_START >> on x86_64. >> >> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> > > [Dropping fastboot mailing list from CC as kexec mailing list is new list > for this discussion] > > [..] >> +void file_open(const char *name) >> +{ >> +if ((fd = open(name, O_RDWR, 0)) < 0) >> +die("Unable to open `%s': %m", name); >> +} >> + >> +static void mketrel(void) >> +{ >> +unsigned char e_type[2]; >> +if (read(fd, _ident, sizeof(e_ident)) != sizeof(e_ident)) >> +die("Cannot read ELF header: %s\n", strerror(errno)); >> + >> +if (memcmp(e_ident, ELFMAG, 4) != 0) >> +die("No ELF magic\n"); >> + >> +if ((e_ident[EI_CLASS] != ELFCLASS64) && >> +(e_ident[EI_CLASS] != ELFCLASS32)) >> +die("Unrecognized ELF class: %x\n", e_ident[EI_CLASS]); >> + >> +if ((e_ident[EI_DATA] != ELFDATA2LSB) && >> +(e_ident[EI_DATA] != ELFDATA2MSB)) >> +die("Unrecognized ELF data encoding: %x\n", e_ident[EI_DATA]); >> + >> +if (e_ident[EI_VERSION] != EV_CURRENT) >> +die("Unknown ELF version: %d\n", e_ident[EI_VERSION]); >> + >> +if (e_ident[EI_DATA] == ELFDATA2LSB) { >> +e_type[0] = ET_REL & 0xff; >> +e_type[1] = ET_REL >> 8; >> +} else { >> +e_type[1] = ET_REL & 0xff; >> +e_type[0] = ET_REL >> 8; >> +} > > Hi Eric, > > Should this be ET_REL or ET_DYN? kexec refuses to load this vmlinux > as it does not find it to be executable type. Doh. It should be ET_DYN. I had relocatable much to much on the brain, and so I stuffed in the wrong type. > I am not well versed with various conventions but if I go through "Executable > and Linking Format" document, this is what it says about various file types. > > • A relocatable file holds code and data suitable for linking with other > object files to create an executable or a shared object file. > > • An executable file holds a program suitable for execution. > > • A shared object file holds code and data suitable for linking in two > contexts. First, the link editor may process it with other relocatable and > shared object files to create another object file. Second, the dynamic > linker combines it with an executable file and other shared objects > to create a process image. > > So above does not seem to fit in the ET_REL type. We can't relink this > vmlinux? And it does not seem to fit in ET_DYN definition too. We are > not relinking this vmlinux with another executable or other relocatable > files. > > I remember once you mentioned the term dynamic executable which can be > loaded at a non-compiled address and let run without requiring any > relocation processing. This vmlinux will fall in that category but can't > relate it to standard elf file definitions. Sorry about that. ET_DYN without a PT_DYNAMIC segment, without a PT_INTERP segment, and with a valid entry point is exactly that. Loaders never perform relocation processing on a ET_DYN executable but they are allowed to shift all of the addresses by a single delta so long as all of the alignment restrictions are honored. Relocation processing when it happens comes from the dynamic linker, which is set in PT_INTERP and the dynamic linker looks a PT_DYNAMIC to figure out what relocations are available for processing. The basic issue is that ld don't really comprehend what we are doing since we are building a position independent executable in a way that the normal tools don't allow, so we have to poke the header. If we had compiled with -fPIC we could have specified -pie or --pic-executable to ld and it would have done the right thing. But as it is our executable only changes physical addresses and not virtual addresses something completely foreign to ld. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* David Lang <[EMAIL PROTECTED]> wrote: > > (Btw., to protect against such mishaps in the future i have changed > > the SysRq-N [SysRq-Nice] implementation in my tree to not only > > change real-time tasks to SCHED_OTHER, but to also renice negative > > nice levels back to 0 - this will show up in -v6. That way you'd > > only have had to hit SysRq-N to get the system out of the wedge.) > > if you are trying to unwedge a system it may be a good idea to renice > all tasks to 0, it could be that a task at +19 is holding a lock that > something else is waiting for. Yeah, that's possible too, but +19 tasks are getting a small but guaranteed share of the CPU so eventually it ought to release it. It's still a possibility, but i think i'll wait for a specific incident to happen first, and then react to that incident :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > yeah, i guess this has little to do with X. I think in your scenario > it might have been smarter to either stop, or to renice the workloads > that took away CPU power from others to _positive_ nice levels. > Negative nice levels can indeed be dangerous. btw., was X itself at nice 0 or nice -10 when the lockup happened? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v4 vs sd-0.44
* Rogan Dawes <[EMAIL PROTECTED]> wrote: > >if (p_to && p->wait_runtime > 0) { > >p->wait_runtime >>= 1; > >p_to->wait_runtime += p->wait_runtime; > >} > > > >the above is the basic expression of: "charge a positive bank balance". > > > > [..] > > > [note, due to the nanoseconds unit there's no rounding loss to worry > > about.] > > Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss? yes. But not that we'll only truly have to worry about that when we'll have context-switching performance in that range - currently it's at least 2-3 orders of magnitude above that. Microseconds seemed to me to be too coarse already, that's why i picked nanoseconds and 64-bit arithmetics for CFS. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]Fix parsing kernelcore boot option for ia64
Mel-san. I tested your patch (Thanks!). It worked. But.. > In my understanding, why ia64 doesn't use early_param() macro for mem= at el. > is that > it has to use mem= option at efi handling which is called before > parse_early_param(). > > Current ia64's boot path is > setup_arch() > -> efi handling -> parse_early_param() -> numa handling -> pgdat/zone init > > kernelcore= option is just used at pgdat/zone initialization. (no arch > dependent part...) > > So I think just adding > == > early_param("kernelcore",cmpdline_parse_kernelcore) > == > to ia64 is ok. Then, it can be common code. How is this patch? I confirmed this can work well too. When "kernelcore" boot option is specified, kernel can't boot up on ia64. It is cause of eternal loop. In addition, its code can be common code. This is fix for it. I tested this patch on my ia64 box. Signed-off-by: Yasunori Goto <[EMAIL PROTECTED]> - arch/i386/kernel/setup.c |1 - arch/ia64/kernel/efi.c |2 -- arch/powerpc/kernel/prom.c |1 - arch/ppc/mm/init.c |2 -- arch/x86_64/kernel/e820.c |1 - include/linux/mm.h |1 - mm/page_alloc.c|3 +++ 7 files changed, 3 insertions(+), 8 deletions(-) Index: kernelcore/arch/ia64/kernel/efi.c === --- kernelcore.orig/arch/ia64/kernel/efi.c 2007-04-24 15:09:37.0 +0900 +++ kernelcore/arch/ia64/kernel/efi.c 2007-04-24 15:25:22.0 +0900 @@ -423,8 +423,6 @@ efi_init (void) mem_limit = memparse(cp + 4, ); } else if (memcmp(cp, "max_addr=", 9) == 0) { max_addr = GRANULEROUNDDOWN(memparse(cp + 9, )); - } else if (memcmp(cp, "kernelcore=",11) == 0) { - cmdline_parse_kernelcore(cp+11); } else if (memcmp(cp, "min_addr=", 9) == 0) { min_addr = GRANULEROUNDDOWN(memparse(cp + 9, )); } else { Index: kernelcore/arch/i386/kernel/setup.c === --- kernelcore.orig/arch/i386/kernel/setup.c2007-04-24 15:29:20.0 +0900 +++ kernelcore/arch/i386/kernel/setup.c 2007-04-24 15:29:39.0 +0900 @@ -195,7 +195,6 @@ static int __init parse_mem(char *arg) return 0; } early_param("mem", parse_mem); -early_param("kernelcore", cmdline_parse_kernelcore); #ifdef CONFIG_PROC_VMCORE /* elfcorehdr= specifies the location of elf core header Index: kernelcore/arch/powerpc/kernel/prom.c === --- kernelcore.orig/arch/powerpc/kernel/prom.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/powerpc/kernel/prom.c 2007-04-24 15:30:25.0 +0900 @@ -431,7 +431,6 @@ static int __init early_parse_mem(char * return 0; } early_param("mem", early_parse_mem); -early_param("kernelcore", cmdline_parse_kernelcore); /* * The device tree may be allocated below our memory limit, or inside the Index: kernelcore/arch/ppc/mm/init.c === --- kernelcore.orig/arch/ppc/mm/init.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/ppc/mm/init.c 2007-04-24 15:30:56.0 +0900 @@ -214,8 +214,6 @@ void MMU_setup(void) } } -early_param("kernelcore", cmdline_parse_kernelcore); - /* * MMU_init sets up the basic memory mappings for the kernel, * including both RAM and possibly some I/O regions, Index: kernelcore/arch/x86_64/kernel/e820.c === --- kernelcore.orig/arch/x86_64/kernel/e820.c 2007-04-24 15:04:47.0 +0900 +++ kernelcore/arch/x86_64/kernel/e820.c2007-04-24 15:34:02.0 +0900 @@ -604,7 +604,6 @@ static int __init parse_memopt(char *p) return 0; } early_param("mem", parse_memopt); -early_param("kernelcore", cmdline_parse_kernelcore); static int userdef __initdata; Index: kernelcore/include/linux/mm.h === --- kernelcore.orig/include/linux/mm.h 2007-04-24 15:09:37.0 +0900 +++ kernelcore/include/linux/mm.h 2007-04-24 15:35:52.0 +0900 @@ -1051,7 +1051,6 @@ extern unsigned long find_max_pfn_with_a extern void free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn); extern void sparse_memory_present_with_active_regions(int nid); -extern int cmdline_parse_kernelcore(char *p); #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID extern int early_pfn_to_nid(unsigned long pfn); #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ Index: kernelcore/mm/page_alloc.c === --- kernelcore.orig/mm/page_alloc.c 2007-04-24 15:09:37.0 +0900 +++ kernelcore/mm/page_alloc.c
Re: [REPORT] cfs-v4 vs sd-0.44
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > [...] That way you'd only have had to hit SysRq-N to get the system > out of the wedge.) small correction: Alt-SysRq-N. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i802.11: fixed memory leak on multicasts
Hi, socket buffers were not always freed when receiving multicasts Bye, -- Markus Pietrek Lead Software Engineer Phone: +49-7667-908-501, Fax: +49-7667-908-200 mailto:[EMAIL PROTECTED] FS Forth-Systeme GmbH "A Digi International Company" Kueferstr. 8, 79206 Breisach, Germany Tax: 07008/12000 / VAT: DE142208834 / Reg. Amtsgericht Freiburg HRB 290212 Directors: Klaus Flesch, Subramanian Krishnan, Dieter Vesper http://www.digi.com Index: net/ieee80211/ieee80211_rx.c === RCS file: /data/vcs/cvs/fsforth_products/LxNETES/linux/net/ieee80211/ieee80211_rx.c,v retrieving revision 1.5 retrieving revision 1.6 diff -c -r1.5 -r1.6 *** net/ieee80211/ieee80211_rx.c13 Apr 2007 12:39:38 - 1.5 --- net/ieee80211/ieee80211_rx.c23 Apr 2007 15:51:28 - 1.6 *** *** 860,868 break; } ! if (is_packet_for_us) if (!ieee80211_rx(ieee, skb, stats)) dev_kfree_skb_irq(skb); return; drop_free: --- 860,871 break; } ! if (is_packet_for_us) { if (!ieee80211_rx(ieee, skb, stats)) dev_kfree_skb_irq(skb); + } else + dev_kfree_skb_irq(skb); + return; drop_free:
cfs works fine for me
Hello, I have tried the cfs patches with 2.6.20.7 in the last days. I am using KDE 3.5.6, gentoo unstable and have a dual core AMD64 system with 1GB ram and a nvidia card (using the closed source drivers, yes I suck, but I love playing 3d games once in a while). I don't have interactivity problems with plain kernel.org kernels (except when swapping a lot, swapping really sucks) My system works well and is stable. With the cfs patches, my system continues to work well. I have not seen any regressions, desktop is snappy, emerge'ing stuff (niced to +19), does not hurt and unreal tournament 2004 is as fast (or slow, depends on the situation) as always. It even looks like FPS under heavy stress (like onslaught torlan when lots of bots and me are fighting at a powernode), don't go down as low as with the mainline scheduler. Not a big difference, but it is there (20-25 with plain kernel.org kernel in extrem situations compared to >30 with the cfs patches). Maybe I did not hit the worst case, playing is a little bit restricted at the moment - my wrist and ellbow hate me, but it looks promising. Apart from the worst case scenrios, FPS are more or less the same. My usage consisted of surfing the web with konqueror, watching videos with xine and mplayer, using kmail (with tens of thousands of mails in different folders), looking at pictures with kuickshow, installing XFCE, asorted updates, typing lots and lots of stuff in kate and web forums, listening to mp3/ogg with amarok, playing pysol/kpat/lgeneral/wesnoth/ut2004/freecol, a lot of that parallel (not ut2004... I don't want to hurt my precious fps...). Again, my system worked fine with the 'normal' scheduler, from the stuff I read in the lkml archives I must be some special kind of guy, so there was no improvement on the 'feels snappy or not' front, but there are also no regressions. So from my point of view, everything is fine with cfs and I would not mind having it as default scheduler. If you want specs of my hardware, my kernel config or any other information, just send me an email. I am not subscribed to lkml, nor can I read any of its archives in the next couple of days, which is one reason why I don't answer to one at the existing threads (I don't even know if there are some at the moment), so in case of an answer cc'ing me would be nice. Glück Auf Volker - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[REPORT] cfs-v5 vs sd-0.46
Hi list, with cfs-v5 finally booting on my machine I have run my daily numbercrunching jobs on both cfs-v5 and sd-0.46, 2.6.21-v7 on top of a stock openSUSE 10.2 (X86_64). Config for both kernel is the same except for the X boost option in cfs-v5 which on my system didn't work (X still was @ -19; I understand this will be fixed in -v6). HZ is 250 in both. System is a Dell XPS M1710, Intel Core2 2.33GHz, 4GB, NVIDIA GeForce Go 7950 GTX with proprietary driver 1.0-9755 I'm running three single threaded perl scripts that do double precision floating point math with little i/o after initially loading the data. Both cfs and sd showed very similar behavior when monitored in top. I'll show more or less representative excerpt from a 10 minutes log, delay 3sec. sd-0.46 top - 00:14:24 up 1:17, 9 users, load average: 4.79, 4.95, 4.80 Tasks: 3 total, 3 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 99.8%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.2%hi, 0.0%si, 0.0%st Mem: 3348628k total, 1648560k used, 1700068k free,64392k buffers Swap: 2097144k total,0k used, 2097144k free, 828204k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 6671 mgd 33 0 95508 22m 3652 R 100 0.7 44:28.11 perl 6669 mgd 31 0 95176 22m 3652 R 50 0.7 43:50.02 perl 6674 mgd 31 0 95368 22m 3652 R 50 0.7 47:55.29 perl cfs-v5 top - 08:07:50 up 21 min, 9 users, load average: 4.13, 4.16, 3.23 Tasks: 3 total, 3 running, 0 sleeping, 0 stopped, 0 zombie Cpu(s): 99.5%us, 0.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 3348624k total, 1193500k used, 2155124k free,32516k buffers Swap: 2097144k total,0k used, 2097144k free, 545568k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 6357 mgd 20 0 92024 19m 3652 R 100 0.6 8:54.21 perl 6356 mgd 20 0 91652 18m 3652 R 50 0.6 10:35.52 perl 6359 mgd 20 0 91700 18m 3652 R 50 0.6 8:47.32 perl What did surprise me is that cpu utilization had been spread 100/50/50 (round robin) most of the time. I did expect 66/66/66 or so. What I also don't understand is the difference in load average, sd constantly had higher values, the above figures are representative for the whole log. I don't know which is better though. Here are excerpts from a concurrently run vmstat 3 200: sd-0.46 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 5 0 0 1702928 63664 82787600 067 458 1350 100 0 0 0 3 0 0 1702928 63684 82787600 089 468 1362 100 0 0 0 5 0 0 1702680 63696 82787600 0 132 461 1598 99 1 0 0 8 0 0 1702680 63712 82789200 080 465 1180 99 1 0 0 3 0 0 1702712 63732 82788400 067 453 1005 100 0 0 0 4 0 0 1702792 63744 82792000 041 461 1138 100 0 0 0 3 0 0 1702792 63760 82791600 057 456 1073 100 0 0 0 3 0 0 1702808 63776 82792800 0 111 473 1095 100 0 0 0 3 0 0 1702808 63788 82792800 081 461 1092 99 1 0 0 3 0 0 1702188 63808 82792800 0 160 463 1437 99 1 0 0 3 0 0 1702064 63884 82790000 0 229 479 1125 99 0 0 0 4 0 0 1702064 63912 82797200 177 460 1108 100 0 0 0 7 0 0 1702032 63920 82800000 040 463 1068 100 0 0 0 4 0 0 1702048 63928 82800800 068 454 1114 100 0 0 0 11 0 0 1702048 63928 82800800 0 0 458 1001 100 0 0 0 3 0 0 1701500 63960 82802000 0
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
On Tue, 24 Apr 2007 15:00:42 +1000, Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote: > Like anything else, modules should have separated the entrypoints for > > - Initiating a removal request > - Releasing the module > > The former is use did "rmmod", can unregister things from subsystems, > etc... (and can file if the driver decides to refuse removal requests > when it's busy doing things or whatever policy that module wants to > implement). > > The later is called when all references to the modules have been > dropped, it's a bit like the kref "release" (and could be implemented as > one). That sounds quite similar to the problems we have with kobject refcounting vs. module unloading. The patchset I posted at http://marc.info/?l=linux-kernel=117679014404994=2 exposes the refcount of the kobject embedded in the module. Maybe the kthread code could use that reference as well? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NonExecutable Bit in 32Bit
On 4/24/07, William Heimbigner <[EMAIL PROTECTED]> wrote: On Tue, 24 Apr 2007, Cestonaro, Thilo (external) wrote: > Hey, > > is it right, that the NX Bit is not used under i386-Arch but > under x86_64-Arch? > When yes, is there a special argument for it not to be used? > > Ciao Thilo I don't think so - some i386 cpus definitely have support for the NX bit. In detail: 1) if your CPU has NX support (some 32bit Xeons do) 2) it is not disabled in the BIOS 3) you see 'nx' in the 'flags' line in /proc/cpuinfo 4) and you have a kernel with the following config options CONFIG_HIGHMEM64G=y CONFIG_HIGHMEM=y CONFIG_X86_PAE=y NX should just work. [snip] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] [PATCH] eHCA: Add "Modify Port" verb
Hi Hal, you are correct, with the current firmware version it will fail later. Christoph R. [EMAIL PROTECTED] wrote on 23.04.2007 18:55:59: > Hi Joachim, > > On Mon, 2007-04-23 at 12:23, Joachim Fenkes wrote: > > Add "Modify Port" verb support to eHCA driver. > > ib_cm needs this to initialize properly. > > I didn't think IB_PORT_SM was allowed (as QP0 is not exposed) or does > this just fail later when it is attempted to be actually set ? > > -- Hal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/9] Kconfig: cleanup s390 v2.
On Mon, 2007-04-23 at 10:45 -0700, Andrew Morton wrote: > > Andrew: I plan to add patches 1-5 to the for-andrew branch of the > > git390 repository if that is fine with you. The only thing that will > > be missing in the tree is the patch that disables wireless for s390. > > The code does compile but without hardware it is mute to have the > > config options. I'll wait until the git-wireless.patch is upstream. > > Patches 7-9 depend on patches found in -mm. > > > > umm, OK. If it's Ok I think I'll duck it for now: -mm is full. > > Over-full, really: I've been working basically continuously since Friday > getting the current dungpile to compile and boot, and it's still miles away > from that. I understand. I'll wait until -mm is a little bit smaller again. It is just that someday I want to finish with the Kconfig cleanup, it has been sitting on my harddriver for ages now. -- blue skies, IBM Deutschland Entwicklung GmbH MartinVorsitzender des Aufsichtsrats: Johann Weihen Geschäftsführung: Herbert Kircher Martin Schwidefsky Sitz der Gesellschaft: Böblingen Linux on zSeries Registergericht: Amtsgericht Stuttgart, Development HRB 243294 "Reality continues to ruin my life." - Calvin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPORT] cfs-v5 vs sd-0.46
* Michael Gerdau <[EMAIL PROTECTED]> wrote: > I'm running three single threaded perl scripts that do double > precision floating point math with little i/o after initially loading > the data. thanks for the testing! > What I also don't understand is the difference in load average, sd > constantly had higher values, the above figures are representative for > the whole log. I don't know which is better though. hm, it's hard from here to tell that. What load average does the vanilla kernel report? I'd take that as a reference. > Here are excerpts from a concurrently run vmstat 3 200: > > sd-0.46 > procs ---memory-- ---swap-- -io -system-- cpu > r b swpd free buff cache si sobibo in cs us sy id wa > 5 0 0 1702928 63664 82787600 067 458 1350 100 0 0 > 0 > 3 0 0 1702928 63684 82787600 089 468 1362 100 0 0 > 0 > 5 0 0 1702680 63696 82787600 0 132 461 1598 99 1 0 0 > 8 0 0 1702680 63712 82789200 080 465 1180 99 1 0 0 > cfs-v5 > procs ---memory-- ---swap-- -io -system-- cpu > r b swpd free buff cache si sobibo in cs us sy id wa > 6 0 0 2157728 31816 54523600 0 103 543 748 100 0 0 > 0 > 4 0 0 2157780 31828 54525600 063 435 752 100 0 0 > 0 > 4 0 0 2157928 31852 54525600 0 105 424 770 100 0 0 > 0 > 4 0 0 2157928 31868 54526800 0 261 457 763 100 0 0 > 0 interesting - CFS has half the context-switch rate of SD. That is probably because on your workload CFS defaults to longer 'timeslices' than SD. You can influence the 'timeslice length' under SD via /proc/sys/kernel/rr_interval (milliseconds units) and under CFS via /proc/sys/kernel/sched_granularity_ns. On CFS the value is not necessarily the timeslice length you will observe - for example in your workload above the granularity is set to 5 msec, but your rescheduling rate is 13 msecs. SD default to a rr_interval value of 8 msecs, which in your workload produces a timeslice length of 6-7 msecs. so to be totally 'fair' and get the same rescheduling 'granularity' you should probably lower CFS's sched_granularity_ns to 2 msecs. > Last not least I'd like to add that at least on my system having X > niced to -19 does result in kind of "erratic" (for lack of a better > word) desktop behavior. I'll will reevaluate this with -v6 but for now > IMO nicing X to -19 is a regression at least on my machine despite the > claim that cfs doesn't suffer from it. indeed with -19 the rescheduling limit is so high under CFS that it does not throttle X's scheduling rate enough and so it will make CFS behave as badly as other schedulers. I retested this with -10 and it should work better with that. In -v6 i changed the default to -10 too. > PS: Only learning how to test these things I'm happy to get pointed > out the shortcomings of what I tested above. Of course suggestions for > improvements are welcome. your report was perfectly fine and useful. "no visible regressions" is valuable feedback too. [ In fact, such type of feedback is the one i find the easiest to resolve ;-) ] Since you are running number-crunchers you might be able to give performacne feedback too: do you have any reliable 'performance metric' available for your number cruncher jobs (ops per minute, runtime, etc.) so that it would be possible to compare number-crunching performance of mainline to SD and to CFS as well? If that value is easy to get and reliable/stable enough to be meaningful. (And it would be nice to also establish some ballpark figure about how much noise there is in any performance metric, so that we can see whether any differences between schedulers are systematic or not.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
cpufreq default governor
Question: is there some reason that kconfig does not allow for default governors of conservative/ondemand/powersave? I'm not aware of any reason why one of those governors could not be used as default. William Heimbigner [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc7: BUG: sleeping function called from invalid context at net/core/sock.c:1523
On Tue, 24 Apr 2007, Herbert Xu wrote: > > Hmm, *sigh*. I guess the patch below fixes the problem, but it is a > > masterpiece in the field of ugliness. And I am not sure whether it is > > completely correct either. Are there any immediate ideas for better > > solution with respect to how struct sock locking works? > Please cc such patches to netdev. Thanks. Hi Herbert, well it's pretty much bluetooth-specific, and bluez-devel was CCed, but OK. > > diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c > > index 71f5cfb..c5c93cd 100644 > > --- a/net/bluetooth/hci_sock.c > > +++ b/net/bluetooth/hci_sock.c > > @@ -656,7 +656,10 @@ static int hci_sock_dev_event(struct notifier_block > > *this, unsigned long event, > >/* Detach sockets from device */ > >read_lock(_sk_list.lock); > >sk_for_each(sk, node, _sk_list.head) { > > - lock_sock(sk); > > + if (in_atomic()) > > + bh_lock_sock(sk); > > + else > > + lock_sock(sk); > > This doesn't do what you think it does. bh_lock_sock can still succeed > even with lock_sock held by someone else. I know, this was precisely the reason why I converted the bh_lock_sock() to lock_sock() here some time ago (as it was racy with l2cap_connect_cfm()). > Does this need to occur immediately when an event occurs? If not I'd > suggest moving this into a workqueue. Will have to check whether this will be processed properly in time when going to suspend. Thanks, -- Jiri Kosina - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/7] libata: check for AN support
Hello, Kristen Carlson Accardi wrote: > static unsigned int ata_print_id = 1; > @@ -1744,6 +1745,23 @@ int ata_dev_configure(struct ata_device > } > dev->cdb_len = (unsigned int) rc; > > + /* > + * check to see if this ATAPI device supports > + * Asynchronous Notification > + */ > + if ((ap->flags & ATA_FLAG_AN) && ata_id_has_AN(id)) > + { > + /* issue SET feature command to turn this on */ > + rc = ata_dev_set_AN(dev); Please don't store err_mask into int rc. Please store it to a separate err_mask variable and report it when printing error message. > + if (rc) { > + ata_dev_printk(dev, KERN_ERR, > + "unable to set AN\n"); > + rc = -EINVAL; Wouldn't -EIO be more appropriate? > + goto err_out_nosup; > + } > + dev->flags |= ATA_DFLAG_AN; > + } > + Not NACKing. Just notes for future improvements. We need to be more careful here. ATA/ATAPI world is filled with braindamaged devices and I bet there are devices which advertises it can do AN but chokes when AN is enabled. This should be handled similarly to ACPI failure. Currently ACPI does the following. 1. try once, if fail, record that ACPI failed. return error to trigger retry. 2. try again, if fail again, ignore error if possible (!FROZEN) and turn off ACPI. This fallback mechanism for optional features can probably be generalized and used for both ACPI and AN. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/7] libata: check for AN support
> + /* > + * check to see if this ATAPI device supports > + * Asynchronous Notification > + */ > + if ((ap->flags & ATA_FLAG_AN) && ata_id_has_AN(id)) > + { Bracketing police ^^^ > + /* issue SET feature command to turn this on */ > + rc = ata_dev_set_AN(dev); > + if (rc) { > + ata_dev_printk(dev, KERN_ERR, > + "unable to set AN\n"); > + rc = -EINVAL; > + goto err_out_nosup; How fatal is this - do we need to ignore the device at this point or should we just pretend (possibly correctly) that the device itself does not support notification. > @@ -299,6 +305,8 @@ struct ata_taskfile { > #define ata_id_queue_depth(id) (((id)[75] & 0x1f) + 1) > #define ata_id_removeable(id)((id)[0] & (1 << 7)) > #define ata_id_has_dword_io(id) ((id)[50] & (1 << 0)) > +#define ata_id_has_AN(id)\ > + ((id[76] && (~id[76])) & ((id)[78] & (1 << 5))) Might be nice to check ATA version as well to be paranoid but this all looks ok as its a reserved field since way back when. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/7] genhd: expose AN to user space
Kristen Carlson Accardi wrote: > +static struct disk_attribute disk_attr_capability = { > + .attr = {.name = "capability_flags", .mode = S_IRUGO }, > + .show = disk_capability_read > +}; How about just "capability"? I think that would be more consistent with other attributes. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/7] libata: send event when AN received
> + /* check the 'N' bit in word 0 of the FIS */ > + if (f[0] & (1 << 15)) { > + int port_addr = ((f[0] & 0x0f00) >> 8); > + struct ata_device *adev = >device[port_addr]; You can't be sure that the port_addr returned will be in range if a device is malfunctioning... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/