RE: [PATCH] Re: reliability of linux-vm subsystem
> > Good, so the OOM killer works. > > But it doesn't work for this kind of application misbehaviours (or > user attacks): > > main() { while(1) if (fork()) malloc(1); } This seems to be a fork() bomb, not a VM issue. The system is overwhelmed by the the forks, not by the space consumed by the allocations themselves. For one thing, I've found that main() { while(1) malloc(1024*1024); } does not kill your system very quickly (if at all). Without actually writing to the memory, it doesn't seem to be "really" allocated. Adding a memset() will kill your system much more quickly. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] bugfix in oom_kill.c
This patch fixes a bug in oom_kill. The way it was written, the OOM killer would try to kill the idle task if the task selected immediately before it had the most "badness". Probably because of the order of for_each_task(), this wouldn't ever happen, but I don't think we want to depend on that. chris --- official/linux-2.4.0/mm/oom_kill.c Mon Nov 6 23:53:01 2000 +++ work/linux-2.4.0-test10/mm/oom_kill.c Thu Nov 9 23:12:10 2000 @@ -124,11 +143,12 @@ read_lock(&tasklist_lock); for_each_task(p) { - if (p->pid) + if (p->pid) { points = badness(p); - if (points > maxpoints) { - chosen = p; - maxpoints = points; + if (points > maxpoints) { + chosen = p; + maxpoints = points; + } } } read_unlock(&tasklist_lock); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
News gateway not working
I didn't want to post to the list with this, but [EMAIL PROTECTED] didn't get a reply. The NNTP gateway hasn't been working for two weeks-- the last list message was 9/2/2000. Maybe this is common knowledge on the list (since I'm not subscribed, I obviously wouldn't know...) but it's a little discouraging to get absolute silence on fa.linux.kernel. If the problem is known, then someone should post directly to the newsgroup and let us know when it might be fixed. I know there were some massive changes when the list switched to new servers, and hopefully that's the reason and it can be fixed soon. For many people, the newsgroup is a much easier way to read the list, and furthermore it reduces the load on vger, because otherwise we'd have to subscribe. So please...look kindly on us. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Linux's implementation of poll() not scalable?
> It doesn't practically matter how efficient the X server is when > you aren't busy, after all. A simple polling scheme (i.e. not using poll() or select(), just looping through all fd's trying nonblocking reads) is perfectly efficient when the server is 100% busy, and perfectly inefficient when there is nothing to do. I'm not saying that your statements are wrong--in your example, X is calling select() which is not wasting as much time as a hard-polling loop--but it's wrong to say that high-load efficiency is the primary concern. I would be horrified if X took a signifigant portion of the CPU time when many clients were connected, but none were actually doing anything. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
include fb.h from userland?
I understand that the headers in /usr/include/linux shouldn't be overwritten by new kernel installs. But can someone elaborate on Linus's original admonition (http://kernelnotes.org/lnxlists/linux-kernel/lk_0007_04/msg00881.html)? Am I never, ever, ever allowed to update my system headers for the rest of my life, or is it only if I follow some particular procedure, such as recompiling glibc? The reason I want to upgrade my system headers is that framebuffer development requires linux/fb.h to be included from userland (I see no way around that). The version of fb.h in my system headers is 2.2.5, the distro version I originally installed. I'm running 2.2.17 kernel now, which has much newer fb.h which I need. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] protect processes from OOM killer
Here's a small patch to allow a user to protect certain PIDs from death- by-OOM-killer. It uses the proc entry '/proc/sys/vm/oom_protect'; echo the PIDs to be protected: echo 1 516 > /proc/sys/vm/oom_protect The idea is that sysadmins can mark some daemon processes as off-limits for the OOM killer. Stuff like syslogd, init, etc. Incidentally, this answers Andrea's concern about the init process getting killed. In fact, it might be a good idea to default the list of protected PIDs to be { 1 }. Things I'd like to add: - ability to append PIDs. Using the 'echo >>' syntax would be nice, but /proc files don't seem to support appending. (is this true?) - symbolic process names as well as PIDs, maybe process groups too? - perhaps a more complex interface, where instead of just marking a PID as absolutely protected, you could specify a 'weight' which factored into the OOM algorithm. Something like "nice": -20 : unkillable -19 to -1: try not to kill 1 to 19: try to kill these first echo netscape:10 > /proc/sys/vm/oom_protect ...would suggest that "netscape" is a process which is a good candidate for OOM killing. I don't think that we should make the OOM heuristic any more complex. However, letting the user make suggestions about what should and should not be killed is a Good Thing. This is my very first patch, so please be considerate. Against 2.4.0-test10. Comments and suggestions appreciated! chris --- official/linux-2.4.0-test10/mm/oom_kill.c Mon Nov 6 23:40:52 2000 +++ work/linux-2.4.0-test10/mm/oom_kill.c Mon Nov 6 23:37:47 2000 @@ -20,9 +20,32 @@ #include #include #include +#include + +#define MAX_OOM_PROTECTS 256 + +int sysctl_oom_protects[MAX_OOM_PROTECTS]; /* #define DEBUG */ +int is_oom_protected(int pid) +{ + int i; + for (i = 0; i < MAX_OOM_PROTECTS; i++) { + int ppid = sysctl_oom_protects[i]; + + #ifdef DEBUG + printk("Protected pid: %d\n",ppid); + #endif + + if (ppid == pid) + return 1; + if (ppid == 0) + return 0; + } + return 0; +} + /** * int_sqrt - oom_kill.c internal function, rough approximation to sqrt * @x: integer of which to calculate the sqrt @@ -124,6 +147,19 @@ read_lock(&tasklist_lock); for_each_task(p) { + #ifdef DEBUG + printk("Testing pid %d\n",p->pid); + #endif + + if (is_oom_protected(p->pid)) + + #ifdef DEBUG + printk("Pid %d is protected\n",p->pid); + #endif + + continue; + } + if (p->pid) points = badness(p); if (points > maxpoints) { --- official/linux-2.4.0-test10/kernel/sysctl.c Mon Nov 6 23:40:52 2000 +++ work/linux-2.4.0-test10/kernel/sysctl.c Mon Nov 6 23:30:08 2000 @@ -85,6 +85,8 @@ extern int pgt_cache_water[]; +extern int sysctl_oom_protects []; + static int parse_table(int *, int, void *, size_t *, void *, size_t, ctl_table *, void **); static int proc_doutsstring(ctl_table *table, int write, struct file *filp, @@ -241,6 +243,10 @@ &bdflush_min, &bdflush_max}, {VM_OVERCOMMIT_MEMORY, "overcommit_memory", &sysctl_overcommit_memory, sizeof(sysctl_overcommit_memory), 0644, NULL, &proc_dointvec}, + + {VM_OVERCOMMIT_MEMORY, "oom_protect", &sysctl_oom_protects, +256, 0644, NULL, &proc_dointvec}, + {VM_BUFFERMEM, "buffermem", &buffer_mem, sizeof(buffer_mem_t), 0644, NULL, &proc_dointvec}, {VM_PAGECACHE, "pagecache", - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
getting a process name from task struct
Is it possible to get a process's name / full execution path (from kernelspace) given only a task struct? I can't find any pointers to this information in the task struct, and I don't know where else it might be. ps seems to be able to get the process name, but that's from userspace. Apologies in advance if this is a stupid question. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] oom_nice
Here's an updated version of the "oom_nice" patch. It allows a sysadmin to set the "oom niceness" for processes, either by PID or by process name. The oom niceness value factors into the badness() function called by Rik's OOM killer. Negative values decrease the chance that the process will be killed, and positive values increase it. The usage is: echo [PID|process_name]=oom_niceness > /proc/sys/vm/oom_nice examples: echo 418=-10 > /proc/sys/vm/oom_nice echo netscape=20 > /proc/sys/vm/oom_nice echo 1=- > /proc/sys/vm/oom_nice In the first example, the process with PID 418 is 10 times less likely to be killed than it would have been. Likewise, in the second example, any processes named 'netscape' are 20 times more likely to be killed than otherwise. The last example protects the init process from being killed, no matter what. cating oom_nice will show the current nice values for all processes. By default the oom_nice proc entry is not world-readable or writable. For security reasons I would suggest that you give good (negative) oom nice values to processes by PID rather than process name. If any process named 'init' is protected, then it's easy for a user to just rename their executable and get around the oom killer. To test the OOM killer algorithm I also inclued a proc entry /proc/sys/vm/oom_nice_test. On my machine 'cat /proc/sys/vm/oom_nice_test' produces: "OOM killer would have killed process 516 (csh) with 496 points" Compiling oom_kill.c with DEBUG defined and cating oom_nice_test will print out the points for all processes, including their oom_nice values and how they affected the final points. diff -u -N official/linux-2.4.0/mm/Makefile work/linux-2.4.0-test10/mm/Makefile --- official/linux-2.4.0/mm/MakefileMon Nov 6 23:53:01 2000 +++ work/linux-2.4.0-test10/mm/Makefile Tue Nov 7 22:01:00 2000 @@ -10,7 +10,8 @@ O_TARGET := mm.o O_OBJS := memory.o mmap.o filemap.o mprotect.o mlock.o mremap.o \ vmalloc.o slab.o bootmem.o swap.o vmscan.o page_io.o \ - page_alloc.o swap_state.o swapfile.o numa.o oom_kill.o + page_alloc.o swap_state.o swapfile.o numa.o oom_kill.o \ + oom_nice.o ifeq ($(CONFIG_HIGHMEM),y) O_OBJS += highmem.o --- official/linux-2.4.0/mm/oom_kill.c Mon Nov 6 23:53:01 2000 +++ work/linux-2.4.0-test10/mm/oom_kill.c Thu Nov 9 23:12:10 2000 @@ -20,9 +20,12 @@ #include #include #include +#include /* #define DEBUG */ +extern int get_oom_nice(struct task_struct *ts); + /** * int_sqrt - oom_kill.c internal function, rough approximation to sqrt * @x: integer of which to calculate the sqrt @@ -55,9 +58,9 @@ *of least surprise ... (be careful when you change it) */ -static int badness(struct task_struct *p) +int badness(struct task_struct *p) { - int points, cpu_time, run_time; + int points, cpu_time, run_time, oom_nice; if (!p->mm) return 0; @@ -101,6 +104,22 @@ */ if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) points /= 4; + + oom_nice = get_oom_nice(p); +#ifdef DEBUG + if (oom_nice != 0) + printk(KERN_DEBUG "OOMkill: task %d (%s) has oom_nice=%d. start points: %d\n", + p->pid,p->comm,oom_nice,points); +#endif + + if (oom_nice == INT_MIN) + points = 0; + else if (oom_nice > 0) + points *= oom_nice; + else if (oom_nice < 0) + points /= -oom_nice; + + #ifdef DEBUG printk(KERN_DEBUG "OOMkill: task %d (%s) got %d points\n", p->pid, p->comm, points); @@ -124,11 +143,12 @@ read_lock(&tasklist_lock); for_each_task(p) { - if (p->pid) + if (p->pid) { points = badness(p); - if (points > maxpoints) { - chosen = p; - maxpoints = points; + if (points > maxpoints) { + chosen = p; + maxpoints = points; + } } } read_unlock(&tasklist_lock); @@ -156,7 +176,7 @@ if (p == NULL) panic("Out of memory and no killable processes...\n"); - printk(KERN_ERR "Out of Memory: Killed process %d (%s).", p->pid, p->comm); + printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", p->pid, p->comm); /* * We give our sacrificial lamb high priority and access to diff -u -N official/linux-2.4.0/mm/oom_nice.c work/linux-2.4.0-test10/mm/oom_nice.c --- official/linux-2.4.0/mm/oom_nice.c Wed Dec 31 19:00:00 1969 +++ work/linux-2.4.0-test10/mm/oom_nice.c Thu Nov 9 23:19:45 2000 @@ -0,0 +1,250 @@ +/* + */ + +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifndef CONFIG_PROC_FS +#error You really need /proc support for oom_
RE: asm-i386/uaccess.h changes: bug or feature?
To clarify: you're getting missing-symbol errors (not duplicate-symbols)? I believe that the "return" versions of these macros have been deprecated. There's an effort going on to replace these functions with a standard "put_user(); return;" pair. People think that having a macro which returns from a function is a bad idea. I imagine the source you're using hasn't been updated; I would suggest removing the xxx_ret macros from the package you're compiling (or contacting its maintainer). chris > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]]On Behalf Of Wes McRae > Sent: Wednesday, October 04, 2000 3:00 PM > To: [EMAIL PROTECTED] > Subject: asm-i386/uaccess.h changes: bug or feature? > > > Hello > > Background: compiling lm_sensors 2.5.2 on RedHat 7.0 running 2.4.0-test8 > kernel. (vanilla intel system) > > Attempting to compile the sensor package gave symbol errors regarding > the xxx_ret symbols (copying to/from user space, putting/getting. These > have been removed from uaccess.h in recent kernels. These were present > in 2.2.16 and at least some 2.3.x kernels. As you can see from the > appended diff, these appear to be the only changes. I tend to assume > there's a reason, but could find no explanation in the kernel docs for > it. > > For what it's worth, I could still compile the application--just not > load its modules. > > Many apologies if this is not the right place to send this--it seemed > the most likely place after checking MAINTAINERS and REPORTING-BUGS. > > bye > wes > > --- /usr/include/asm/uaccess.h Fri Aug 25 08:31:57 2000 > +++ /usr/src/linux/include/asm/uaccess.hFri Sep 29 12:09:04 2000 > > @@ -232,20 +232,6 @@ > : "=r"(err), ltype (x) \ > : "m"(__m(addr)), "i"(-EFAULT), "0"(err)) > > -/* > - * The "xxx_ret" versions return constant specified in third argument, > if > - * something bad happens. These macros can be optimized for the > - * case of just returning from the function xxx_ret is used. > - */ > - > -#define put_user_ret(x,ptr,ret) ({ if (put_user(x,ptr)) return ret; }) > - > -#define get_user_ret(x,ptr,ret) ({ if (get_user(x,ptr)) return ret; }) > - > -#define __put_user_ret(x,ptr,ret) ({ if (__put_user(x,ptr)) return ret; > }) > - > -#define __get_user_ret(x,ptr,ret) ({ if (__get_user(x,ptr)) return ret; > }) > - > > /* > * Copy To/From Userspace > @@ -582,10 +568,6 @@ > (__builtin_constant_p(n) ? \ > __constant_copy_from_user((to),(from),(n)) : \ > __generic_copy_from_user((to),(from),(n))) > - > -#define copy_to_user_ret(to,from,n,retval) ({ if > (copy_to_user(to,from,n)) return retval; }) > - > -#define copy_from_user_ret(to,from,n,retval) ({ if > (copy_from_user(to,from,n)) return retval; }) > > #define __copy_to_user(to,from,n) \ > (__builtin_constant_p(n) ? \ > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: __bad_udelay in 2.2.18pre15
> > 2.2.18pre15 defines udelay as (in file include/asm-i386/delay.h) : > > extern void __bad_udelay(void); > > > > #define udelay(n) (__builtin_constant_p(n) ? \ > > ((n) > 2 ? __bad_udelay() : __const_udelay((n) * > > 0x10c6ul)) : \ > > __udelay(n)) > > > > ... > > It seems __bad_udelay is not defined anywhere in the kernel source. > > Correct. Its a compile time error trap Wouldn't it be better to use an #error directive? I'm sure this could turn into a FAQ, even though the symbol is called "__bad_udelay()". chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Updated 2.4 TODO List
> On Wed, 11 Oct 2000 18:10:40 -0400, > [EMAIL PROTECTED] wrote: > >Are you sure it was compiled with the correct CPU? If you configure the > >CPU incorrectly (686 when you only have a 586, etc.) the kernel *will* > >refuse to boot. > > > >Maybe we should have the kernel print the CPU information it was > >compiled with before it does anything else. It'll make it easier to > >catch what may be a fairly common set of PEBCAK case > > Unfortunately any code like this > if (a) > b = 99; > generates conditional move (cmove) instructions on 686. In vsprintf.c > there are several of these constructs, in particular strnlen generates > it. So printk("%s", text) tends to fault as well. Some people have > argued that critical routines should always be compiled with -i386, > unfortunately that includes all of printk and all console handling > (both serial and screen), not really an option. > > If anything is going to detect the mismatch and complain, it has to be > the boot loader, after uncompressing and before entering the kernel > proper. But the kernel should be able to write directly to the screen, even if it's extremely minimal information. Something like how LILO does it: test the common hang-on-boot conditions (like wrong CPU type) and print a single character after each test. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: large memory support for x86
> > Am I reading this correctly--the address of the main() function for a > > process is guaranteed to be the lowest possible virtual address? > > > > chris > > > > It is one of the lowest. The 'C' runtime library puts section > .text (the code) first, then .data, then .bss, then .stack. The > .stack section is co-located with the heap which can be extended > by setting a new break address. > > When a process is created, the lowest address is the entry point of > crt0.o _init. We can see where that is by: > > Script started on Thu Oct 12 14:25:35 2000 > # cat xxx.c > > extern int _init(); > main() > { > printf("_init is at %p\n", _init); > } > > # gcc -o xxx xxx.c > # ./xxx > _init is at 0x804838c > # exit > exit > Script done on Thu Oct 12 14:25:51 2000 > > That said, remember that in Unix, the 'C' rutime library exists in the > lower portion of the .text section. So your code's virtual address space > starts above that address space. This is MMAPed so everybody gets > to share the same pages. In this way, you don't all have to keep a > private copy of the 'C' runtime library. User-process virtual addresses have no direct relation to physical addresses, right? So why does the process space start at such a high virtual address (why not closer to 0x)? Seems we're wasting ~128 megs of RAM. Not a huge amount compared to 4G, but signifigant. Is that space used (libc can't be that big!) or reserved somehow? Another question: how (and where in the code) do we translate virtual user-addresses to physical addresses? Does the MMU do it, or does it call a kernel handler function? Why is the kernel allowed to reference physical addresses, while user processes go through the translation step? Can kernel pages be swapped out / faulted in just like user process pages? Sorry to pounce on you with all of these questions. I've read up on this stuff but can't always find answers... thanks-- chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
why is modprobe (and nothing else) exec()'d?
Why is modprobe kept as a separate executable, when nothing else in the kernel is (seems to be)? What is the advantage to keeping modprobe separate, instead of statically linked into the kernel? Are users able to replace modprobe with a better version? If so, why not do the same thing with other occasionally-used code which could be replaced? Something like Rik's OOM killer comes to mind, except that obviously if you're out of memory you're not going to be able to load a new executable. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: why is modprobe (and nothing else) exec()'d?
Ok, I should have thought of that ;-). I've never used modprobe directly myself, and had forgotten that was possible. Thanks to everyone who replied. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: large memory support for x86
> no, x86 virtual memory is 32 bits - segmentation only provides a way to > segment this 4GB virtual memory, but cannot extend it. Under Linux there > is 3GB virtual memory available to user-space processes. > > this 3GB virtual memory does not have to be mapped to the same physical > pages all the time - and this is nothing new. mmap()/munmap()-ing memory > dynamically is one way to 'extend' the amount of physical RAM controlled > by a single process. I doubt this would be very economical though. > > Such big-RAM systems are almost always SMP systems, so eg. a 4-way system > can have 4x 3GB processes == 12 GB RAM fully utilized. An 8-way system can > utilize up to 24 GB RAM at once, without having to play mmap/munmap > 'memory extender' tricks. Why is it that a user process can't intentionally switch segments? Dereferencing a 32-bit address causes the address to be calculated using the "current" segment descriptor, right? It seems to me that a process could set a new segment selector, in which case a dereference would operate on a whole new segment. Is there a reason why processes are limited to a single segment? chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
quick questions: kernel stack size and call gates
1. Does Linux use call gates (as specified in the Intel SDK vol.3) when a user process makes a system call? From what I understand, call-gates let a ring-3 process execute ring-0 code, which sounds exactly like a system call. I've found all of the actual system call functions (sys_ni etc.) in sys.c, but where is the code which userland calls to transfer to "kernel mode" and execute a particular syscall? 2. I've often heard that the kernel stack size is set small (4 or 8k?). Is this done by limiting the size of the stack segment itself for the kernel? Where is the code which sets up the limit? tia chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/