Re: BTRFS partition usage...
On Tue, 12 Feb 2008, Jeff Garzik wrote: > > Yep. I chose 32K unused space in the prototype filesystem I wrote [1, 2.4 > era]. I'm pretty sure I got that number from some other filesystem, maybe > even some NTFS incarnation. NTFS superblock (and the partial mirror copy) can be anywhere except in the first blocks. That space is where the $BOOT file is placed which contains the bootstrap code and the BIOS Paramether Block which includes the NTFS signature and describes various filesystem information needed to locate the superblock, etc. Unlike mkfs.xfs which warns since at least 2002 and requires the -f option to override Sun disklabels, at the moment mkfs.ntfs will indeed destroy them. Thank you for the bug report and let's hope the next generation of Sun hardwares won't scatter the firmware too into random places inside a partition encoded by a fictitious size of disk cylinder. Szaka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BTRFS partition usage...
On Tue, 12 Feb 2008, Jeff Garzik wrote: Yep. I chose 32K unused space in the prototype filesystem I wrote [1, 2.4 era]. I'm pretty sure I got that number from some other filesystem, maybe even some NTFS incarnation. NTFS superblock (and the partial mirror copy) can be anywhere except in the first blocks. That space is where the $BOOT file is placed which contains the bootstrap code and the BIOS Paramether Block which includes the NTFS signature and describes various filesystem information needed to locate the superblock, etc. Unlike mkfs.xfs which warns since at least 2002 and requires the -f option to override Sun disklabels, at the moment mkfs.ntfs will indeed destroy them. Thank you for the bug report and let's hope the next generation of Sun hardwares won't scatter the firmware too into random places inside a partition encoded by a fictitious size of disk cylinder. Szaka -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] util-linux-ng: unprivileged mounts support
On Sat, 19 Jan 2008, Miklos Szeredi wrote: > > But 'fusermount -u /tmp/test' does work, doesn't it? You're submitting patches to get rid of fusermount, aren't you? Most users absolutely have no idea what fusermount is and they would __really__ like to see umount(8) working finally. Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] util-linux-ng: unprivileged mounts support
On Wed, 16 Jan 2008, Miklos Szeredi wrote: > This is an experimental patch for supporing unprivileged mounts and > umounts. User unmount unfortunately still doesn't work if the kernel doesn't have the unprivileged mount support but as we discussed this in last July that shouldn't be needed for this case. % mount -t ntfs-3g /dev/hda10 /tmp/test % cat /proc/mounts | grep /tmp/test /dev/hda10 /tmp/test fuseblk rw,nosuid,nodev,user_id=501,group_id=501,allow_other 0 0 % mount | grep /tmp/test /dev/hda10 on /tmp/test type fuseblk (rw,nosuid,nodev,allow_other,blksize=1024,user=szaka) % umount /tmp/test umount: /dev/hda10: not mounted umount: /tmp/test: must be superuser to umount umount: /dev/hda10: not mounted umount: /tmp/test: must be superuser to umount Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] util-linux-ng 2.13.1 (stable)
On Wed, 16 Jan 2008, Karel Zak wrote: > mount: >- doesn't drop privileges properly when calling helpers [Ludwig Nussel] How can a mount helper know without being setuid root and redundantly doing mount(8)'s work that the user is allowed to mount via the 'user[s]' fstab mount option? Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] util-linux-ng 2.13.1 (stable)
On Wed, 16 Jan 2008, Karel Zak wrote: mount: - doesn't drop privileges properly when calling helpers [Ludwig Nussel] How can a mount helper know without being setuid root and redundantly doing mount(8)'s work that the user is allowed to mount via the 'user[s]' fstab mount option? Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] util-linux-ng: unprivileged mounts support
On Wed, 16 Jan 2008, Miklos Szeredi wrote: This is an experimental patch for supporing unprivileged mounts and umounts. User unmount unfortunately still doesn't work if the kernel doesn't have the unprivileged mount support but as we discussed this in last July that shouldn't be needed for this case. % mount -t ntfs-3g /dev/hda10 /tmp/test % cat /proc/mounts | grep /tmp/test /dev/hda10 /tmp/test fuseblk rw,nosuid,nodev,user_id=501,group_id=501,allow_other 0 0 % mount | grep /tmp/test /dev/hda10 on /tmp/test type fuseblk (rw,nosuid,nodev,allow_other,blksize=1024,user=szaka) % umount /tmp/test umount: /dev/hda10: not mounted umount: /tmp/test: must be superuser to umount umount: /dev/hda10: not mounted umount: /tmp/test: must be superuser to umount Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] util-linux-ng: unprivileged mounts support
On Sat, 19 Jan 2008, Miklos Szeredi wrote: But 'fusermount -u /tmp/test' does work, doesn't it? You're submitting patches to get rid of fusermount, aren't you? Most users absolutely have no idea what fusermount is and they would __really__ like to see umount(8) working finally. Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Tue, 15 Jan 2008, Daniel Phillips wrote: > Along with this effort, could you let me know if the world actually > cares about online fsck? Now we know how to do it I think, but is it > worth the effort. Most users seem to care deeply about "things just work". Here is why ntfs-3g also took the online fsck path some time ago. NTFS support had a highly bad reputation on Linux thus the new code was written with rigid sanity checks and extensive automatic, regression testing. One of the consequences is that we're detecting way too many inconsistencies left behind by the Windows and other NTFS drivers, hardware faults, device drivers. To better utilize the non-existing developer resources, it was obvious to suggest the already existing Windows fsck (chkdsk) in such cases. Simple and safe as most people like us would think who never used Windows. However years of experience shows that depending on several factors chkdsk may start or not, may report the real problems or not, but on the other hand it may report bogus issues, it may run long or just forever, and it may even remove completely valid files. So one could perhaps even consider suggestions to run chkdsk a call to play Russian roulette. Thankfully NTFS has some level of metadata redundancy with signatures and weak "checksums" which make possible to correct some common and obvious corruptions on the fly. Similarly to ZFS, Windows Server 2008 also has self-healing NTFS: http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df0f6e6805d1033.mspx?mfr=true Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Tue, 15 Jan 2008, Daniel Phillips wrote: Along with this effort, could you let me know if the world actually cares about online fsck? Now we know how to do it I think, but is it worth the effort. Most users seem to care deeply about things just work. Here is why ntfs-3g also took the online fsck path some time ago. NTFS support had a highly bad reputation on Linux thus the new code was written with rigid sanity checks and extensive automatic, regression testing. One of the consequences is that we're detecting way too many inconsistencies left behind by the Windows and other NTFS drivers, hardware faults, device drivers. To better utilize the non-existing developer resources, it was obvious to suggest the already existing Windows fsck (chkdsk) in such cases. Simple and safe as most people like us would think who never used Windows. However years of experience shows that depending on several factors chkdsk may start or not, may report the real problems or not, but on the other hand it may report bogus issues, it may run long or just forever, and it may even remove completely valid files. So one could perhaps even consider suggestions to run chkdsk a call to play Russian roulette. Thankfully NTFS has some level of metadata redundancy with signatures and weak checksums which make possible to correct some common and obvious corruptions on the fly. Similarly to ZFS, Windows Server 2008 also has self-healing NTFS: http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df0f6e6805d1033.mspx?mfr=true Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
Hi, On Wed, 9 Jan 2008, Nigel Cunningham wrote: > On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote: > > > > For the suspend issue, there are also no easy solutions. > > What are the non-easy solutions? A practical point of view I've seen only fuse rootfs mounts to be a problem. I remember Ubuntu patches for this (WUBI and some other distros install NTFS root). But this probably also depends on the used suspend implementation. Personally I've never had fuse related suspend problem with ordinary mounts during heavy use under development, nor NTFS user problem was tracked down to it in the last one and half year. Regards, Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
Hi, On Wed, 9 Jan 2008, Nigel Cunningham wrote: On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote: For the suspend issue, there are also no easy solutions. What are the non-easy solutions? A practical point of view I've seen only fuse rootfs mounts to be a problem. I remember Ubuntu patches for this (WUBI and some other distros install NTFS root). But this probably also depends on the used suspend implementation. Personally I've never had fuse related suspend problem with ordinary mounts during heavy use under development, nor NTFS user problem was tracked down to it in the last one and half year. Regards, Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts
On Tue, 8 Jan 2008, Miklos Szeredi wrote: > > On Tue, 2008-01-08 at 12:35 +0100, Miklos Szeredi wrote: > > > +static int reserve_user_mount(void) > > > +{ > > > + int err = 0; > > > + > > > + spin_lock(_lock); > > > + if (nr_user_mounts >= max_user_mounts && !capable(CAP_SYS_ADMIN)) > > > + err = -EPERM; > > > + else > > > + nr_user_mounts++; > > > + spin_unlock(_lock); > > > + return err; > > > +} > > > > Would -ENOSPC or -ENOMEM be a more descriptive error here? > > The logic behind EPERM, is that this failure is only for unprivileged > callers. ENOMEM is too specifically about OOM. It could be changed > to ENOSPC, ENFILE, EMFILE, or it could remain EPERM. What do others > think? I think it would be important to log the non-trivial errors. Several mount(8) hints to check for the reason by dmesg since it's already too challanging to figure out what's exactly the problem by the errno value. This could also prevent to mislead troubleshooters with the mount/sysctl race. Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts
On Tue, 8 Jan 2008, Miklos Szeredi wrote: On Tue, 2008-01-08 at 12:35 +0100, Miklos Szeredi wrote: +static int reserve_user_mount(void) +{ + int err = 0; + + spin_lock(vfsmount_lock); + if (nr_user_mounts = max_user_mounts !capable(CAP_SYS_ADMIN)) + err = -EPERM; + else + nr_user_mounts++; + spin_unlock(vfsmount_lock); + return err; +} Would -ENOSPC or -ENOMEM be a more descriptive error here? The logic behind EPERM, is that this failure is only for unprivileged callers. ENOMEM is too specifically about OOM. It could be changed to ENOSPC, ENFILE, EMFILE, or it could remain EPERM. What do others think? I think it would be important to log the non-trivial errors. Several mount(8) hints to check for the reason by dmesg since it's already too challanging to figure out what's exactly the problem by the errno value. This could also prevent to mislead troubleshooters with the mount/sysctl race. Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Sat, 27 Oct 2007, Anton Altaparmakov wrote: > And another of my pet peeves with ->bmap is that it uses 0 to mean "sparse" > which causes a conflict on NTFS at least as block zero is part of the $Boot > system file so it is a real, valid block... NTFS uses -1 to denote sparse > blocks internally. In practice, the meaning of 0 is file system [driver] dependent. For example in case of NTFS-3G it means that the block is sparse or the file is encrypted or compressed, or resident, or it's the $Boot file, or an error happened. Thankfully the widely used FIBMAP users (swapon and the ever less used lilo) are only interested in the non-zero values and they report an error if the driver returns 0 for some reason. Which is perfectly ok since both swaping and Linux booting would fail using a sparse, encrypted, compressed, resident, or the NTFS $Boot file. But in real, both swap files and lilo work fine with NTFS if the needed files were created the way these softwares expect. If not then swapon or lilo will catch and report the file creation error. Afair, somebody is doing (has done?) an indeed much needed, better alternative. Bmap is legacy, thank you Mike for maintaining it. Szaka -- NTFS-3G Lead Developer: http://ntfs-3g.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Sat, 27 Oct 2007, Anton Altaparmakov wrote: And another of my pet peeves with -bmap is that it uses 0 to mean sparse which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. In practice, the meaning of 0 is file system [driver] dependent. For example in case of NTFS-3G it means that the block is sparse or the file is encrypted or compressed, or resident, or it's the $Boot file, or an error happened. Thankfully the widely used FIBMAP users (swapon and the ever less used lilo) are only interested in the non-zero values and they report an error if the driver returns 0 for some reason. Which is perfectly ok since both swaping and Linux booting would fail using a sparse, encrypted, compressed, resident, or the NTFS $Boot file. But in real, both swap files and lilo work fine with NTFS if the needed files were created the way these softwares expect. If not then swapon or lilo will catch and report the file creation error. Afair, somebody is doing (has done?) an indeed much needed, better alternative. Bmap is legacy, thank you Mike for maintaining it. Szaka -- NTFS-3G Lead Developer: http://ntfs-3g.org - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: curedump configuration additions
On Sat, 5 May 2001, Michael Miller wrote: > +coredump_enabled: > +When enabled (which is the default), Linux will produce [...] > +coredump_log: > +The default is to log coredumps. The default looks like an effective way to DoS logging, fill system partition fast. Nice other optional feature would be for development, debug, QA point of view to be able to dump set[ug]id or apps that changed its uid or gid, ala kern.sugid_coredump (FreeBSD) kern.nosuidcoredump (OpenBSD) allow_setid_core (Solaris) etc. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Thread core dumps for 2.4.4
On Thu, 3 May 2001, Don Dugger wrote: > The attached patch allows core dumps from thread processes in the 2.4.4 > kernel. This patch is the same as the last one I sent out except it fixes > the same bug that `kernel/fork.c' had with duplicate info in the `mm' > structure, plus this patch has had more extensive testing. AFAIK Linux can't dump the threads to the same file as others but doing it to different files looks a bit scary. How the system behaves when you dump a heavy threaded app with a decent VM [i.e just think about a bloatware instead of malicious code]? How will the developer know which thread caused the fault? I've found dumping just the faulting thread is enough about 100% of the cases especially because [on SMP] others can run on and the dump is much more close to "garbage" then usuful info from a debug point of view. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Thread core dumps for 2.4.4
On Thu, 3 May 2001, Don Dugger wrote: The attached patch allows core dumps from thread processes in the 2.4.4 kernel. This patch is the same as the last one I sent out except it fixes the same bug that `kernel/fork.c' had with duplicate info in the `mm' structure, plus this patch has had more extensive testing. AFAIK Linux can't dump the threads to the same file as others but doing it to different files looks a bit scary. How the system behaves when you dump a heavy threaded app with a decent VM [i.e just think about a bloatware instead of malicious code]? How will the developer know which thread caused the fault? I've found dumping just the faulting thread is enough about 100% of the cases especially because [on SMP] others can run on and the dump is much more close to garbage then usuful info from a debug point of view. Szaka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: curedump configuration additions
On Sat, 5 May 2001, Michael Miller wrote: +coredump_enabled: +When enabled (which is the default), Linux will produce [...] +coredump_log: +The default is to log coredumps. The default looks like an effective way to DoS logging, fill system partition fast. Nice other optional feature would be for development, debug, QA point of view to be able to dump set[ug]id or apps that changed its uid or gid, ala kern.sugid_coredump (FreeBSD) kern.nosuidcoredump (OpenBSD) allow_setid_core (Solaris) etc. Szaka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: __alloc_pages: 4-order allocation failed
On Thu, 26 Apr 2001, Jeff V. Merkey wrote: > I am seeing this as well on 2.4.3 with both _get_free_pages() and > kmalloc(). In the kmalloc case, the modules hang waiting > for memory. One possible source of this hang is due to the change below in 2.4.3, non GPF_ATOMIC and non-recursive allocations (PF_MEMALLOC is set) will loop until the requested continuous memory is available. Szaka diff -u --recursive --new-file v2.4.2/linux/mm/page_alloc.c linux/mm/page_alloc.c--- v2.4.2/linux/mm/page_alloc.cSat Feb 3 19:51:32 2001 +++ linux/mm/page_alloc.c Tue Mar 20 15:05:46 2001 @@ -455,8 +455,7 @@ memory_pressure++; try_to_free_pages(gfp_mask); wakeup_bdflush(0); - if (!order) - goto try_again; + goto try_again; } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer *WORKS* for a change!
On Fri, 13 Apr 2001, Mike A. Harris wrote: > I just ran netscape which for some reason or another went totally > whacky and gobbled RAM. It has done this before and made the box > totally unuseable in 2.2.17-2.2.19 befor the kernel killed 90% of > my running apps before getting the right one. I ported the 2.4 OOM killer about half year ago to 2.2, available for 2.2.19 kernel at http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_memory.html Note, since it's activated in page fault handler that is architecture dependent, the current patch works only on x86 (the only one I could test). If one is interested in other archs, let me know. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer *WORKS* for a change!
On Fri, 13 Apr 2001, Mike A. Harris wrote: I just ran netscape which for some reason or another went totally whacky and gobbled RAM. It has done this before and made the box totally unuseable in 2.2.17-2.2.19 befor the kernel killed 90% of my running apps before getting the right one. I ported the 2.4 OOM killer about half year ago to 2.2, available for 2.2.19 kernel at http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_memory.html Note, since it's activated in page fault handler that is architecture dependent, the current patch works only on x86 (the only one I could test). If one is interested in other archs, let me know. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scheduler went mad?
On Thu, 12 Apr 2001, Rik van Riel wrote: > On Thu, 12 Apr 2001, Szabolcs Szakacsits wrote: > > You mean without dropping out_of_memory() test in kswapd and calling > > oom_kill() in page fault [i.e. without additional patch]? > No. I think it's ok for __alloc_pages() to call oom_kill() > IF we turn out to be out of memory, but that should not even > be needed. Not __alloc_pages() calls oom_kill() however do_page_fault(). Not the same. After the system tried *really* hard to get *one* free page and couldn't managed why loop forever? To eat CPU and waiting for out_of_memory() to *guess* when system is in OOM? I don't think so, if processes can't progress because system can't page in any of their pages, somebody must go. > Also, when a task in __alloc_pages() is OOM-killed, it will > have PF_MEMALLOC set and will immediately break out of the > loop. The rest of the system will spin around in the loop > until the victim has exited and then their allocations will > succeed. Yes, I think this is a problem. In page fault if OOM, "bad" process selected, scheduled, killed and everybody runs happily even without to notice system is low on memory. Fast and gracious process killing instead of slow, painful death IF out_of_memory() correctly detects OOM. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scheduler went mad?
On Thu, 12 Apr 2001, Rik van Riel wrote: > On Thu, 12 Apr 2001, Szabolcs Szakacsits wrote: > > I still feel a bit unconfortable about processes looping forever in > > __alloc_pages and because of this oom_killer also can't be moved to > > page fault handler where I think its place should be. I'm using the > > patch below. > It's BROKEN. This means that if you have one task using up > all memory and you're waiting for the OOM kill of that task > to have effect, your syslogd, etc... will have their allocations > fail and will die. You mean without dropping out_of_memory() test in kswapd and calling oom_kill() in page fault [i.e. without additional patch]? Yes, you're competely true but I have the patch [see example below, 'm1' is the bad guy] just didn't have time to extensively test it and don't know whether there is side efffects getting rid of this infinite looping in __alloc_pages() but locked up processes apparently don't make people very happy ;) Szaka Out of Memory: Killed process 830 (m1), saved process 696 (httpd) procs memoryswap io system r b w swpd free buff cache si sobibo incs 6 0 0 0 9492100 1496 0 0 1386 2 2904 3877 5 0 0 0 7812104 1788 0 0 289 0 68922 5 0 0 0 6248104 1788 0 0 0 0 10819 5 0 0 0 4748108 1840 0 056 0 21921 5 0 0 0 3268108 1868 0 028 0 16523 5 0 1 0 1864 76 1868 0 0 0 5 12061 5 0 1 0 1432 76 1252 0 0 0 0 108 1130 5 0 1 0 1236 80796 0 065 0 246 4588 5 0 1 0 1236 80668 0 0 0 0 110 8869 6 0 1 0948112696 0 0 805 0 1814 8231 Out of Memory: Killed process 858 (m1), saved process 811 (vmstat) 5 0 1 0924152444 0 0 1153 0 2731 18231 4 0 1 0 1720148828 0 0 750 3 1711 1876 5 0 1 0 1156148760 0 0 290 0 723 1967 4 0 1 0 1152132664 0 070 0 277 7249 4 0 1 0 1140144560 0 054 0 238 7942 4 0 1 0 1140144460 0 032 0 212 7521 Out of Memory: Killed process 834 (m1), saved process 418 (identd) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scheduler went mad?
On Thu, 12 Apr 2001, Marcelo Tosatti wrote: > This patch is broken, ignore it. > Just removing wakeup_bdflush() is indeed correct. > We already wakeup bdflush at try_to_free_buffers() anyway. I still feel a bit unconfortable about processes looping forever in __alloc_pages and because of this oom_killer also can't be moved to page fault handler where I think its place should be. I'm using the patch below. Szaka --- mm/page_alloc.c.orig Sat Mar 31 19:07:22 2001 +++ mm/page_alloc.c Mon Apr 2 21:05:31 2001 @@ -453,8 +453,12 @@ */ if (gfp_mask & __GFP_WAIT) { memory_pressure++; - try_to_free_pages(gfp_mask); - wakeup_bdflush(0); + if (!try_to_free_pages(gfp_mask)); + return NULL; goto try_again; } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scheduler went mad?
On Thu, 12 Apr 2001, Marcelo Tosatti wrote: This patch is broken, ignore it. Just removing wakeup_bdflush() is indeed correct. We already wakeup bdflush at try_to_free_buffers() anyway. I still feel a bit unconfortable about processes looping forever in __alloc_pages and because of this oom_killer also can't be moved to page fault handler where I think its place should be. I'm using the patch below. Szaka --- mm/page_alloc.c.orig Sat Mar 31 19:07:22 2001 +++ mm/page_alloc.c Mon Apr 2 21:05:31 2001 @@ -453,8 +453,12 @@ */ if (gfp_mask __GFP_WAIT) { memory_pressure++; - try_to_free_pages(gfp_mask); - wakeup_bdflush(0); + if (!try_to_free_pages(gfp_mask)); + return NULL; goto try_again; } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scheduler went mad?
On Thu, 12 Apr 2001, Rik van Riel wrote: On Thu, 12 Apr 2001, Szabolcs Szakacsits wrote: I still feel a bit unconfortable about processes looping forever in __alloc_pages and because of this oom_killer also can't be moved to page fault handler where I think its place should be. I'm using the patch below. It's BROKEN. This means that if you have one task using up all memory and you're waiting for the OOM kill of that task to have effect, your syslogd, etc... will have their allocations fail and will die. You mean without dropping out_of_memory() test in kswapd and calling oom_kill() in page fault [i.e. without additional patch]? Yes, you're competely true but I have the patch [see example below, 'm1' is the bad guy] just didn't have time to extensively test it and don't know whether there is side efffects getting rid of this infinite looping in __alloc_pages() but locked up processes apparently don't make people very happy ;) Szaka Out of Memory: Killed process 830 (m1), saved process 696 (httpd) procs memoryswap io system r b w swpd free buff cache si sobibo incs 6 0 0 0 9492100 1496 0 0 1386 2 2904 3877 5 0 0 0 7812104 1788 0 0 289 0 68922 5 0 0 0 6248104 1788 0 0 0 0 10819 5 0 0 0 4748108 1840 0 056 0 21921 5 0 0 0 3268108 1868 0 028 0 16523 5 0 1 0 1864 76 1868 0 0 0 5 12061 5 0 1 0 1432 76 1252 0 0 0 0 108 1130 5 0 1 0 1236 80796 0 065 0 246 4588 5 0 1 0 1236 80668 0 0 0 0 110 8869 6 0 1 0948112696 0 0 805 0 1814 8231 Out of Memory: Killed process 858 (m1), saved process 811 (vmstat) 5 0 1 0924152444 0 0 1153 0 2731 18231 4 0 1 0 1720148828 0 0 750 3 1711 1876 5 0 1 0 1156148760 0 0 290 0 723 1967 4 0 1 0 1152132664 0 070 0 277 7249 4 0 1 0 1140144560 0 054 0 238 7942 4 0 1 0 1140144460 0 032 0 212 7521 Out of Memory: Killed process 834 (m1), saved process 418 (identd) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scheduler went mad?
On Thu, 12 Apr 2001, Rik van Riel wrote: On Thu, 12 Apr 2001, Szabolcs Szakacsits wrote: You mean without dropping out_of_memory() test in kswapd and calling oom_kill() in page fault [i.e. without additional patch]? No. I think it's ok for __alloc_pages() to call oom_kill() IF we turn out to be out of memory, but that should not even be needed. Not __alloc_pages() calls oom_kill() however do_page_fault(). Not the same. After the system tried *really* hard to get *one* free page and couldn't managed why loop forever? To eat CPU and waiting for out_of_memory() to *guess* when system is in OOM? I don't think so, if processes can't progress because system can't page in any of their pages, somebody must go. Also, when a task in __alloc_pages() is OOM-killed, it will have PF_MEMALLOC set and will immediately break out of the loop. The rest of the system will spin around in the loop until the victim has exited and then their allocations will succeed. Yes, I think this is a problem. In page fault if OOM, "bad" process selected, scheduled, killed and everybody runs happily even without to notice system is low on memory. Fast and gracious process killing instead of slow, painful death IF out_of_memory() correctly detects OOM. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pcnet32 (maybe more) hosed in 2.4.3
On Fri, 30 Mar 2001, Scott G. Miller wrote: > Linux 2.4.3, Debian Woody. 2.4.2 works without problems. However, in > 2.4.3, pcnet32 loads, gives an error message: 2.4.3 (and -ac's) are also broken as guest in VMWware due to the pcnet32 changes [doing 32 bit IO on 16 bit regs on the 79C970A controller]. Reverting this part of patch-2.4.3 below made things work again. Szaka @@ -528,11 +535,13 @@ pcnet32_dwio_reset(ioaddr); pcnet32_wio_reset(ioaddr); -if (pcnet32_wio_read_csr (ioaddr, 0) == 4 && pcnet32_wio_check (ioaddr)) { - a = _wio; +/* Important to do the check for dwio mode first. */ +if (pcnet32_dwio_read_csr(ioaddr, 0) == 4 && pcnet32_dwio_check(ioaddr)) { +a = _dwio; } else { - if (pcnet32_dwio_read_csr (ioaddr, 0) == 4 && pcnet32_dwio_check(ioaddr)) { - a = _dwio; +if (pcnet32_wio_read_csr(ioaddr, 0) == 4 && + pcnet32_wio_check(ioaddr)) { + a = _wio; } else return -ENODEV; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pcnet32 (maybe more) hosed in 2.4.3
On Fri, 30 Mar 2001, Scott G. Miller wrote: Linux 2.4.3, Debian Woody. 2.4.2 works without problems. However, in 2.4.3, pcnet32 loads, gives an error message: 2.4.3 (and -ac's) are also broken as guest in VMWware due to the pcnet32 changes [doing 32 bit IO on 16 bit regs on the 79C970A controller]. Reverting this part of patch-2.4.3 below made things work again. Szaka @@ -528,11 +535,13 @@ pcnet32_dwio_reset(ioaddr); pcnet32_wio_reset(ioaddr); -if (pcnet32_wio_read_csr (ioaddr, 0) == 4 pcnet32_wio_check (ioaddr)) { - a = pcnet32_wio; +/* Important to do the check for dwio mode first. */ +if (pcnet32_dwio_read_csr(ioaddr, 0) == 4 pcnet32_dwio_check(ioaddr)) { +a = pcnet32_dwio; } else { - if (pcnet32_dwio_read_csr (ioaddr, 0) == 4 pcnet32_dwio_check(ioaddr)) { - a = pcnet32_dwio; +if (pcnet32_wio_read_csr(ioaddr, 0) == 4 + pcnet32_wio_check(ioaddr)) { + a = pcnet32_wio; } else return -ENODEV; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Thu, 29 Mar 2001, Dr. Michael Weller wrote: > Applications forking and then dirtying their shared data pages > madly? OOps.. nothing.. Why? It cannot be done! In eager mode Solaris, Tru64, Irix, non-overcommit patch for Linux by Eduardo Horvath from last year can do (you get ENOMEM at fork). Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Thu, 29 Mar 2001, Dr. Michael Weller wrote: > On Thu, 29 Mar 2001, Szabolcs Szakacsits wrote: > > The point is AIX *can* guarantee [even for an ordinary process] that > > your signal handler will be executed, Linux can *not*. It doesn't matter > No it can't... and the reason is... So AIX is buggy in eager mode not reserving a couple of extra pages [per process] to be able to run the handler. What AIX version(s) you use? Anyway, as you probably noticed at present I'm not a big supporter of introducing SIGDANGER, too many things can be messed up for little or no gain. > Note that there are nasty users like me, which provide a no_op function > as SIGDANGER handler. For example this. > Joe blow user can code a SIGDANGER exploiting prog that will kill the > whole concept by allocating memory in SIGDANGER. And this. Moreover it shouldn't be malicious, people write happily sighandlers that would blowup thing even without they realise ... And admin still have no control over the things ;) Sure it could be worked around these but I feel it just doesn't worth for the added complexity. > About this early alloction myths: Did you actually read the page? > The fact its controlled by a silly environment variable shows it > is a mere user space issue. This is my question as well ;) Although I didn't read the AIX source but guessed kernel sets a bit in the task structure for eager mode during the exec() syscall and takes care about everything, at least this is what the document suggests ;) [see the bottom of the page] Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Thu, 29 Mar 2001, Dr. Michael Weller wrote: > On Wed, 28 Mar 2001, Andreas Dilger wrote: > > Szaka writes: > > > And every time the SIGDANGER comes up, the issue that AIX provides > > > *both* early and late allocation mechanism even on per-process basis > > > that can be controlled by *both* the programmer and the admin is > > > completely ignored. Linux supports none of these... > Maybe some details here were helpful. http://www.unet.univie.ac.at/aix/aixbman/baseadmn/pag_space_under.htm > > > ...with the current model it's quite possible the handler code is still > > > sitting on the disk and no memory will be available to page it in. > > Actually, I see SIGDANGER being useful at the time we hit freepages.min The point is AIX *can* guarantee [even for an ordinary process] that your signal handler will be executed, Linux can *not*. It doesn't matter where the different oom watermarks are, there would be always such situations when your handler would get the control it's already far too late [because between sending SIGDANGER and app getting the control (you can't schedule e.g. 1000 apps at the same time) the system run into oom and killed just your app (and e.g. the other 999 buggy mem leaking app registered a no-op SIGDANGER handler), hope you get the picture even the example is highly unrealistic]. > > (or maybe even freepages.low), rather than actual OOM so that the apps > > have a chance to DO something about the memory shortage, Primarily *users* should have a chance to control this thing, not developers and kernel. The laters should provide a way to control things and have a reasonable default [Linux already has the latest but not the former]. Guess what one wants to be killed if he runs Oracle in production and DB2, Informix, Sybase, etc in trial. Now only kernel decides. With SIGDANDER also only developers/kernel would decide [now forget about resource management that would prevent running all of them on the same box, Linux users want to fully utilize the box ;)]. So again, IMHO to address this long standing problem, Linux needs - optional non-overcommit, with per-process granularity would be nice [and leave the default just as-is now], oom killer could weight also based on this info additionally - reserved/quaranteed superuser memory [otherwise in non-overcommit mode in user space oom(=system oom), oom killer would just take action] - and as a last chance not to deadlock, advisory oom killer with reasonable default [the current default is pretty fine - apart from its current bugs] - a HOWTO about preventing OOM, killing your important processes, etc Later on virtual swap space [the degree of memory overcommitment] and SIGDANGER maybe would be useful but I don't think so at present. BTW, the issue is far more difficult as some people, let's "fix malloc and its friends" think. > If freepages.min is reached, AIX starts to kill processes (just like OOM > killer). It uses some heuristics which might be better than our, but I > doubt it. If every process runs in non-overcommit mode AIX kills init first :) Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Thu, 29 Mar 2001, Dr. Michael Weller wrote: On Wed, 28 Mar 2001, Andreas Dilger wrote: Szaka writes: And every time the SIGDANGER comes up, the issue that AIX provides *both* early and late allocation mechanism even on per-process basis that can be controlled by *both* the programmer and the admin is completely ignored. Linux supports none of these... Maybe some details here were helpful. http://www.unet.univie.ac.at/aix/aixbman/baseadmn/pag_space_under.htm ...with the current model it's quite possible the handler code is still sitting on the disk and no memory will be available to page it in. Actually, I see SIGDANGER being useful at the time we hit freepages.min The point is AIX *can* guarantee [even for an ordinary process] that your signal handler will be executed, Linux can *not*. It doesn't matter where the different oom watermarks are, there would be always such situations when your handler would get the control it's already far too late [because between sending SIGDANGER and app getting the control (you can't schedule e.g. 1000 apps at the same time) the system run into oom and killed just your app (and e.g. the other 999 buggy mem leaking app registered a no-op SIGDANGER handler), hope you get the picture even the example is highly unrealistic]. (or maybe even freepages.low), rather than actual OOM so that the apps have a chance to DO something about the memory shortage, Primarily *users* should have a chance to control this thing, not developers and kernel. The laters should provide a way to control things and have a reasonable default [Linux already has the latest but not the former]. Guess what one wants to be killed if he runs Oracle in production and DB2, Informix, Sybase, etc in trial. Now only kernel decides. With SIGDANDER also only developers/kernel would decide [now forget about resource management that would prevent running all of them on the same box, Linux users want to fully utilize the box ;)]. So again, IMHO to address this long standing problem, Linux needs - optional non-overcommit, with per-process granularity would be nice [and leave the default just as-is now], oom killer could weight also based on this info additionally - reserved/quaranteed superuser memory [otherwise in non-overcommit mode in user space oom(=system oom), oom killer would just take action] - and as a last chance not to deadlock, advisory oom killer with reasonable default [the current default is pretty fine - apart from its current bugs] - a HOWTO about preventing OOM, killing your important processes, etc Later on virtual swap space [the degree of memory overcommitment] and SIGDANGER maybe would be useful but I don't think so at present. BTW, the issue is far more difficult as some people, let's "fix malloc and its friends" think. If freepages.min is reached, AIX starts to kill processes (just like OOM killer). It uses some heuristics which might be better than our, but I doubt it. If every process runs in non-overcommit mode AIX kills init first :) Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Thu, 29 Mar 2001, Dr. Michael Weller wrote: On Thu, 29 Mar 2001, Szabolcs Szakacsits wrote: The point is AIX *can* guarantee [even for an ordinary process] that your signal handler will be executed, Linux can *not*. It doesn't matter No it can't... and the reason is... So AIX is buggy in eager mode not reserving a couple of extra pages [per process] to be able to run the handler. What AIX version(s) you use? Anyway, as you probably noticed at present I'm not a big supporter of introducing SIGDANGER, too many things can be messed up for little or no gain. Note that there are nasty users like me, which provide a no_op function as SIGDANGER handler. For example this. Joe blow user can code a SIGDANGER exploiting prog that will kill the whole concept by allocating memory in SIGDANGER. And this. Moreover it shouldn't be malicious, people write happily sighandlers that would blowup thing even without they realise ... And admin still have no control over the things ;) Sure it could be worked around these but I feel it just doesn't worth for the added complexity. About this early alloction myths: Did you actually read the page? The fact its controlled by a silly environment variable shows it is a mere user space issue. This is my question as well ;) Although I didn't read the AIX source but guessed kernel sets a bit in the task structure for eager mode during the exec() syscall and takes care about everything, at least this is what the document suggests ;) [see the bottom of the page] Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Thu, 29 Mar 2001, Dr. Michael Weller wrote: Applications forking and then dirtying their shared data pages madly? OOps.. nothing.. Why? It cannot be done! In eager mode Solaris, Tru64, Irix, non-overcommit patch for Linux by Eduardo Horvath from last year can do (you get ENOMEM at fork). Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Tue, 27 Mar 2001, Rogier Wolff wrote: > Out of Memory: Killed process 117 (sendmail). [ ... many of these ... ] > Out of Memory: Killed process 117 (sendmail). > > What we did to run it out of memory, I don't know. But I do know that > it shouldn't be killing one process more than once... (the process > should not exist after one try...) I already noted this last week. Processes in TASK_UNINTERRUPTIBLE state can't be scheduled so won't be killed immediately. This state can be also permanent if the process using [buggy?] smbfs, nfs without the 'hard,intr' option, buggy drivers or hardwares. What worse, if this state is permanent, a lockup is guaranteed [the other, random OOM killer in page fault handler never gets the chance to run for some mysterious reasons (it worked fine in 2.2)]. Solution is easy, one bit in task structure should indicate that the process already SIGKILL'ed ... oops but it must be already there, so it should be just taken into account by OOM killer. Hopefully it won't result a massacre ... [that would be still better then a lockup, wouldn't be ;)] Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Tue, 27 Mar 2001, Andreas Dilger wrote: > Every time this subject comes up, I point to AIX and SIGDANGER - a signal > sent to processes when the system gets OOM. And every time the SIGDANGER comes up, the issue that AIX provides *both* early and late allocation mechanism even on per-process basis that can be controlled by *both* the programmer and the admin is completely ignored. Linux supports none of these and with the current model it's quite possible the handler code is still sitting on the disk and no memory will be available to page it in. Or do you want to see more apps running as setuid-root and mlocking the handler wasting useful memory and opening even more window for security exploits in the future? And even using capabilities instead of setuid-root, only developers could influence the behavior, not admins who must operate the box. No, at present the SIGDANGER bloat would be just a fake excuse but wouldn't address the root of problems at all. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Tue, 27 Mar 2001, Andreas Dilger wrote: Every time this subject comes up, I point to AIX and SIGDANGER - a signal sent to processes when the system gets OOM. And every time the SIGDANGER comes up, the issue that AIX provides *both* early and late allocation mechanism even on per-process basis that can be controlled by *both* the programmer and the admin is completely ignored. Linux supports none of these and with the current model it's quite possible the handler code is still sitting on the disk and no memory will be available to page it in. Or do you want to see more apps running as setuid-root and mlocking the handler wasting useful memory and opening even more window for security exploits in the future? And even using capabilities instead of setuid-root, only developers could influence the behavior, not admins who must operate the box. No, at present the SIGDANGER bloat would be just a fake excuse but wouldn't address the root of problems at all. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM killer???
On Tue, 27 Mar 2001, Rogier Wolff wrote: Out of Memory: Killed process 117 (sendmail). [ ... many of these ... ] Out of Memory: Killed process 117 (sendmail). What we did to run it out of memory, I don't know. But I do know that it shouldn't be killing one process more than once... (the process should not exist after one try...) I already noted this last week. Processes in TASK_UNINTERRUPTIBLE state can't be scheduled so won't be killed immediately. This state can be also permanent if the process using [buggy?] smbfs, nfs without the 'hard,intr' option, buggy drivers or hardwares. What worse, if this state is permanent, a lockup is guaranteed [the other, random OOM killer in page fault handler never gets the chance to run for some mysterious reasons (it worked fine in 2.2)]. Solution is easy, one bit in task structure should indicate that the process already SIGKILL'ed ... oops but it must be already there, so it should be just taken into account by OOM killer. Hopefully it won't result a massacre ... [that would be still better then a lockup, wouldn't be ;)] Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Sat, 24 Mar 2001, Jesse Pollard wrote: > On Fri, 23 Mar 2001, Alan Cox wrote: [ about non-overcommit ] > > Nobody feels its very important because nobody has implemented it. Enterprises use other systems because they have much better resource management than Linux -- adding non-overcommit wouldn't help them much. Desktop users, Linux newbies don't understand what's eager/early/non-overcommit vs lazy/late/overcommit memory management [just see these threads here if you aren't bored already enough ;)] and even if they do at last they don't have the ability to implement it. And between them, people are mostly fine with ulimit. > Small correction - It was implemented, just not included in the standard > kernel. Please note, adding optional non-overcommit also wouldn't help much without guaranteed/reserved resources [e.g. you are OOM -> appps, users complain, admin login in and BANG OOM killer just killed one of the jobs]. This was one of the reasons I made the reserved root memory patch [this is also the way other OS'es do]. Now just the different patches should be merged and write an OOM FAQ for users how to avoid, control, etc it]. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Sat, 24 Mar 2001, Jesse Pollard wrote: On Fri, 23 Mar 2001, Alan Cox wrote: [ about non-overcommit ] Nobody feels its very important because nobody has implemented it. Enterprises use other systems because they have much better resource management than Linux -- adding non-overcommit wouldn't help them much. Desktop users, Linux newbies don't understand what's eager/early/non-overcommit vs lazy/late/overcommit memory management [just see these threads here if you aren't bored already enough ;)] and even if they do at last they don't have the ability to implement it. And between them, people are mostly fine with ulimit. Small correction - It was implemented, just not included in the standard kernel. Please note, adding optional non-overcommit also wouldn't help much without guaranteed/reserved resources [e.g. you are OOM - appps, users complain, admin login in and BANG OOM killer just killed one of the jobs]. This was one of the reasons I made the reserved root memory patch [this is also the way other OS'es do]. Now just the different patches should be merged and write an OOM FAQ for users how to avoid, control, etc it]. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Fri, 23 Mar 2001, Alan Cox wrote: > > > and rely on it. You might find you need a few Gbytes of swap just to > > > boot > > Seems a bit exaggeration ;) Here are numbers, > NetBSD is if I remember rightly still using a.out library styles. No, it uses ELF today, moreover the numbers were from Solaris. NetBSD also switched from non-overcommit to overcommit-only [AFAIK] mode with "random" process killing with its new UVM. > > 6-50% more VM and the performance hit also isn't so bad as it's thought > > (Eduardo Horvath sent a non-overcommit patch for Linux about one year > > ago). > The Linux performance hit would be so close to zero you shouldnt be able to > measure it - or it was in 1.2 anyway Yep, something like this :) Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Fri, 23 Mar 2001, Paul Jakma wrote: > On Fri, 23 Mar 2001, Szabolcs Szakacsits wrote: > > About the "use resource limits!". Yes, this is one solution. The > > *expensive* solution (admin time, worse resource utilization, etc). Thanks for cutting out relevant parts that said how to increase user base and satisfaction keeping and using the existent possibility as well. > traditional user limits have worse resource utilisation? think what > kind of utilisation a guaranteed allocation system would have. instead > of 128MB, you'd need maybe a GB of RAM and many many GB of swap for > most systems. Nonsense hodgepodge. See and/or mesaure the impact. I sent numbers in my former email. You also missed non-overcommit must be _optional_ [i.e. you wouldn't be forced to use it ;)]. Yes, there are users and enterprises who require it and would happily pay the 50-100% extra swap space for the same workload and extra reliability. > - setting up limits on a RH system takes 1 minute by editing > /etc/security/limits.conf. At every time you add/delete users, add/delete special apps, etc. Please note again, some people wants this way, some only for sometimes, and others really don't care because system guarantees for the admins they will always have the resources to take action [unfortunately this is not Linux]. > - Rik's current oom killer may not do a good job now, but it's > impossible for it to do a /perfect/ job without implementing > kernel/esp.c. Rik's killer is quite fine at _default_. But there will be always people who won't like it [the bastards think humans can still make better decisions than machines]. Wouldn't it be win for both sides if you could point out, "Hey, if you don't like the default, use the /proc/sys/vm/oom_killer interface"? As I said before there are also such patch by Chris Swiedler and definitely not a huge, complex one. And these stupid threads could be forgotten for good and all. > - with limits set you will have: > - /possible/ underutilisation on some workloads. Depends, guaranteed underutilisation or guaranteed extra unreliability fit the picture many times as well. > no matter how good or bad Rik's killer is, i'd much rather set limits > and just about /never/ have it invoked. Thanks for expressing your opinion but others [not necessarily me] have "occasionally" other one depending on the job what the box must do. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Thu, 22 Mar 2001, Alan Cox wrote: > I'd like to have it there as an option. As to the default - You > would have to see how much applications assume they can overcommit > and rely on it. You might find you need a few Gbytes of swap just to > boot Seems a bit exaggeration ;) Here are numbers, http://lists.openresources.com/NetBSD/tech-userlevel/msg00722.html 6-50% more VM and the performance hit also isn't so bad as it's thought (Eduardo Horvath sent a non-overcommit patch for Linux about one year ago). Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Thu, 22 Mar 2001, Guest section DW wrote: > Presently however, a flawless program can be killed. > That is what makes Linux unreliable. Your advocation is "save the application, crash the OS!". But you can't be blamed because everybody's first reaction is this :) But if you start to think you get the conclusion that process killing can't be avoided if you want the system keep running. But I agree Linux lacks some important things [see my other email] that could make the situation easily and inexpensively controllable. BTW, your app isn't flawless because it doesn't consider Linux memory management is [quasi-]overcommit-only at present ;) [or you used other apps as well, e.g. login, ps, cron is enough to kill your app when it stopped at OOM time]. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Thu, 22 Mar 2001, Alan Cox wrote: > One of the things that we badly need to resurrect for 2.5 is the > beancounter work which would let you reasonably do things like > guaranteed Oracle a certain amount of the machine, or restrict all > the untrusted users to a total of 200Mb hard limit between them etc This would improve Linux reliability but it could be much better with added *optional* non-overcommit (most other OS also support this, also that's the default mostly [please no, "but it deadlocks" because it's not true, they also kill processes (Solaris, etc)]), reserved superuser memory (ala Solaris, True64, etc when OOM in non-overcommit, users complain and superuser acts, not the OS killing their tasks) and superuser *advisory* OOM killer [there was patch for this before], I think in the last area Linux is already more ahead than others at present. About the "use resource limits!". Yes, this is one solution. The *expensive* solution (admin time, worse resource utilization, etc). Others make it cheaper mixing with the above ones. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Wed, 21 Mar 2001, Rik van Riel wrote: > One question ... has the OOM killer ever selected init on > anybody's system ? Hi Rik, When I ported your OOM killer to 2.2.x and integrated it into the 'reserved root memory' [*] patch, during intensive testing I found two cases when init was killed. It happened on low-end machines and when OOM killer wasn't triggered so init was killed in the page fault handler. The later was also one of the reasons I replaced the "random" OOM killer in page fault handler with yours [so there is only one OOM killer]. I also asked you at that time whether there was any reason you didn't put it also there but unfortunately you didn't answer. Practice showed it works there as well [and actually some crashes that was reported here recently could have been avoided in this way] but technically maybe I missed something? Other things that bothered me, - niced processes are penalized - trying to kill a task that is permanently in TASK_UNINTERRUPTIBLE will probably deadlock the machine [or the random OOM killer will kill the box]. Szaka [*] who are interested, it can be found at http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_memory.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Wed, 21 Mar 2001, Rik van Riel wrote: One question ... has the OOM killer ever selected init on anybody's system ? Hi Rik, When I ported your OOM killer to 2.2.x and integrated it into the 'reserved root memory' [*] patch, during intensive testing I found two cases when init was killed. It happened on low-end machines and when OOM killer wasn't triggered so init was killed in the page fault handler. The later was also one of the reasons I replaced the "random" OOM killer in page fault handler with yours [so there is only one OOM killer]. I also asked you at that time whether there was any reason you didn't put it also there but unfortunately you didn't answer. Practice showed it works there as well [and actually some crashes that was reported here recently could have been avoided in this way] but technically maybe I missed something? Other things that bothered me, - niced processes are penalized - trying to kill a task that is permanently in TASK_UNINTERRUPTIBLE will probably deadlock the machine [or the random OOM killer will kill the box]. Szaka [*] who are interested, it can be found at http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_memory.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Thu, 22 Mar 2001, Alan Cox wrote: One of the things that we badly need to resurrect for 2.5 is the beancounter work which would let you reasonably do things like guaranteed Oracle a certain amount of the machine, or restrict all the untrusted users to a total of 200Mb hard limit between them etc This would improve Linux reliability but it could be much better with added *optional* non-overcommit (most other OS also support this, also that's the default mostly [please no, "but it deadlocks" because it's not true, they also kill processes (Solaris, etc)]), reserved superuser memory (ala Solaris, True64, etc when OOM in non-overcommit, users complain and superuser acts, not the OS killing their tasks) and superuser *advisory* OOM killer [there was patch for this before], I think in the last area Linux is already more ahead than others at present. About the "use resource limits!". Yes, this is one solution. The *expensive* solution (admin time, worse resource utilization, etc). Others make it cheaper mixing with the above ones. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Thu, 22 Mar 2001, Alan Cox wrote: I'd like to have it there as an option. As to the default - You would have to see how much applications assume they can overcommit and rely on it. You might find you need a few Gbytes of swap just to boot Seems a bit exaggeration ;) Here are numbers, http://lists.openresources.com/NetBSD/tech-userlevel/msg00722.html 6-50% more VM and the performance hit also isn't so bad as it's thought (Eduardo Horvath sent a non-overcommit patch for Linux about one year ago). Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Thu, 22 Mar 2001, Guest section DW wrote: Presently however, a flawless program can be killed. That is what makes Linux unreliable. Your advocation is "save the application, crash the OS!". But you can't be blamed because everybody's first reaction is this :) But if you start to think you get the conclusion that process killing can't be avoided if you want the system keep running. But I agree Linux lacks some important things [see my other email] that could make the situation easily and inexpensively controllable. BTW, your app isn't flawless because it doesn't consider Linux memory management is [quasi-]overcommit-only at present ;) [or you used other apps as well, e.g. login, ps, cron is enough to kill your app when it stopped at OOM time]. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Fri, 23 Mar 2001, Paul Jakma wrote: On Fri, 23 Mar 2001, Szabolcs Szakacsits wrote: About the "use resource limits!". Yes, this is one solution. The *expensive* solution (admin time, worse resource utilization, etc). Thanks for cutting out relevant parts that said how to increase user base and satisfaction keeping and using the existent possibility as well. traditional user limits have worse resource utilisation? think what kind of utilisation a guaranteed allocation system would have. instead of 128MB, you'd need maybe a GB of RAM and many many GB of swap for most systems. Nonsense hodgepodge. See and/or mesaure the impact. I sent numbers in my former email. You also missed non-overcommit must be _optional_ [i.e. you wouldn't be forced to use it ;)]. Yes, there are users and enterprises who require it and would happily pay the 50-100% extra swap space for the same workload and extra reliability. - setting up limits on a RH system takes 1 minute by editing /etc/security/limits.conf. At every time you add/delete users, add/delete special apps, etc. Please note again, some people wants this way, some only for sometimes, and others really don't care because system guarantees for the admins they will always have the resources to take action [unfortunately this is not Linux]. - Rik's current oom killer may not do a good job now, but it's impossible for it to do a /perfect/ job without implementing kernel/esp.c. Rik's killer is quite fine at _default_. But there will be always people who won't like it [the bastards think humans can still make better decisions than machines]. Wouldn't it be win for both sides if you could point out, "Hey, if you don't like the default, use the /proc/sys/vm/oom_killer interface"? As I said before there are also such patch by Chris Swiedler and definitely not a huge, complex one. And these stupid threads could be forgotten for good and all. - with limits set you will have: - /possible/ underutilisation on some workloads. Depends, guaranteed underutilisation or guaranteed extra unreliability fit the picture many times as well. no matter how good or bad Rik's killer is, i'd much rather set limits and just about /never/ have it invoked. Thanks for expressing your opinion but others [not necessarily me] have "occasionally" other one depending on the job what the box must do. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Prevent OOM from killing init
On Fri, 23 Mar 2001, Alan Cox wrote: and rely on it. You might find you need a few Gbytes of swap just to boot Seems a bit exaggeration ;) Here are numbers, NetBSD is if I remember rightly still using a.out library styles. No, it uses ELF today, moreover the numbers were from Solaris. NetBSD also switched from non-overcommit to overcommit-only [AFAIK] mode with "random" process killing with its new UVM. 6-50% more VM and the performance hit also isn't so bad as it's thought (Eduardo Horvath sent a non-overcommit patch for Linux about one year ago). The Linux performance hit would be so close to zero you shouldnt be able to measure it - or it was in 1.2 anyway Yep, something like this :) Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: system call for process information?
On Wed, 14 Mar 2001, Alexander Viro wrote: > On Wed, 14 Mar 2001, Szabolcs Szakacsits wrote: > > read() doesn't really work for this purpose, it blocks way too many > > times to be very annoying. When finally data arrives it's useless. > Huh? Take code of your non-blocking syscall. Make it ->read() for > relevant file on /proc or wherever else you want it. See read() not > blocking... Sorry I should have quoted "blocks". Problem isn't with blocking but *no* data, no information. In the end you can conclude you know *nothing* what happend in the last t time interval - this can be second, minutes even with an RT, mlocked, etc process when the load is around 0. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: system call for process information?
On Mon, 12 Mar 2001, Alexander Viro wrote: > On Mon, 12 Mar 2001, Guennadi Liakhovetski wrote: > > I need to collect some info on processes. One way is to read /proc > > tree. But isn't there a system call (ioctl) for this? And what are those > Occam's Razor. Why invent new syscall when read() works? read() doesn't really work for this purpose, it blocks way too many times to be very annoying. When finally data arrives it's useless. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: system call for process information?
On Mon, 12 Mar 2001, Alexander Viro wrote: On Mon, 12 Mar 2001, Guennadi Liakhovetski wrote: I need to collect some info on processes. One way is to read /proc tree. But isn't there a system call (ioctl) for this? And what are those Occam's Razor. Why invent new syscall when read() works? read() doesn't really work for this purpose, it blocks way too many times to be very annoying. When finally data arrives it's useless. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: system call for process information?
On Wed, 14 Mar 2001, Alexander Viro wrote: On Wed, 14 Mar 2001, Szabolcs Szakacsits wrote: read() doesn't really work for this purpose, it blocks way too many times to be very annoying. When finally data arrives it's useless. Huh? Take code of your non-blocking syscall. Make it -read() for relevant file on /proc or wherever else you want it. See read() not blocking... Sorry I should have quoted "blocks". Problem isn't with blocking but *no* data, no information. In the end you can conclude you know *nothing* what happend in the last t time interval - this can be second, minutes even with an RT, mlocked, etc process when the load is around 0. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Disk Performance/File IO per process
On Mon, 29 Jan 2001, Chris Evans wrote: > Stephen Tweedie has a rather funky i/o stats enhancement patch which > should provide what you need. It comes with RedHat7.0 and gives decent > disk statistics in /proc/partitions. Monitoring via /proc [not just IO but close to anything] has the features: - slow, not atomic, not scalable - if kernel decides explicitely or due to a "bug" to refuse doing IO, you get something like this [even using a mlocked, RT monitor], procsmemoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 0 1 1 27116 1048 736 152832 128 1972 2544 869 44 1812 2 43 55 5 0 2 27768 1048 744 153372 52 1308 2668 777 43 1772 2 61 37 0 2 1 28360 1048 752 153900 332 564 2311 955 49 2081 1 68 31 1 7 2 28356 1048 752 153708 3936 0 2175 29091 494 27348 0 1 99 1 0 2 28356 1048 792 153656 172 0 7166 0 144 838 4 17 80 In short, monitoring via /proc is unreliable. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Disk Performance/File IO per process
On Mon, 29 Jan 2001, Chris Evans wrote: Stephen Tweedie has a rather funky i/o stats enhancement patch which should provide what you need. It comes with RedHat7.0 and gives decent disk statistics in /proc/partitions. Monitoring via /proc [not just IO but close to anything] has the features: - slow, not atomic, not scalable - if kernel decides explicitely or due to a "bug" to refuse doing IO, you get something like this [even using a mlocked, RT monitor], procsmemoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 0 1 1 27116 1048 736 152832 128 1972 2544 869 44 1812 2 43 55 5 0 2 27768 1048 744 153372 52 1308 2668 777 43 1772 2 61 37 0 2 1 28360 1048 752 153900 332 564 2311 955 49 2081 1 68 31 frozen 1 7 2 28356 1048 752 153708 3936 0 2175 29091 494 27348 0 1 99 1 0 2 28356 1048 792 153656 172 0 7166 0 144 838 4 17 80 In short, monitoring via /proc is unreliable. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1pre8 slowdown on dbench tests
On Fri, 19 Jan 2001, Jens Axboe wrote: > On Fri, Jan 19 2001, Szabolcs Szakacsits wrote: > > Redone with big enough swap by requests. > > 2.4.0,132MB swap > > 548.81user 128.97system11:22 99%CPU (442433major+705419minor) > > 561.12user 171.06system12:29 97%CPU (446949major+712525minor) > > 625.68user 2833.29system 1:12:38 79%CPU (638957major+1463974minor) > > === > > 2.4.1pre8,132MB swap > > 548.71user 117.93system11:09 99%CPU (442434major+705420minor) > > 558.93user 166.82system12:20 98%CPU (446941major+712662minor) > > 621.37user 2592.54system 1:07:33 79%CPU (592679major+1311442minor) > > Better, could you try with the number changes that Andrea suggested > too? Thanks. Helped intensive swapping a bit, degraded other cases [no or slight swapping]. 2.4.1pre8,32MB RAM,132MB swap,blk suggestion 544.19user 141.25system11:31 99%CPU (442419major+705411minor) 554.83user 191.57system12:41 98%CPU (445762major+710409minor) 612.05user 2551.37system 1:07:21 78%CPU (589623major+1313665minor) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1pre8 slowdown on dbench tests
Redone with big enough swap by requests. 2.4.0,132MB swap 548.81user 128.97system11:22 99%CPU (442433major+705419minor) 561.12user 171.06system12:29 97%CPU (446949major+712525minor) 625.68user 2833.29system 1:12:38 79%CPU (638957major+1463974minor) === 2.4.1pre8,132MB swap 548.71user 117.93system11:09 99%CPU (442434major+705420minor) 558.93user 166.82system12:20 98%CPU (446941major+712662minor) 621.37user 2592.54system 1:07:33 79%CPU (592679major+1311442minor) > Below some kernel compile numbers on a 32 MB RAM + 32 MB swap box. The > three lines mean compilation with the -j1, -j2 and -j4 option. Most of > the time 2.4.1pre8 was also unable to compile the kernel because cc1 > was killed by OOM handler. > > 2.2.18 > 548.27user 94.18system 10:50 98%CPU (450479major+696869minor) > 548.94user 153.85system11:51 98%CPU (487111major+704948minor) > 599.44user 2018.66system 51:47 84%CPU (2295045major+1182819minor) > = > 2.4.0 > 557.18user 121.57system11:25 99%CPU (442434major+705429minor) > 551.76user 158.78system12:11 97%CPU (446183major+711572minor) > 579.65user 2860.53system 1:05:45 87%CPU (650964major+1209969minor) > === > 2.4.0+blk-13B > 546.89user 140.35system11:33 99%CPU (442435major+705424minor) > 570.73user 188.51system12:56 97%CPU (445171major+712791minor) > 566.33user 2681.20system 1:02:26 86%CPU (654402major+1225784minor) > = > 2.4.1pre8 > 546.23user 118.81system11:09 99%CPU (442434major+705424minor) > 569.12user 161.25system12:22 98%CPU (446667major+712457minor) > 727.58user 2489.96system 1:25:34 62%CPU (616240major+1375321minor) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1pre8 slowdown on dbench tests
On Thu, 18 Jan 2001, Marcelo Tosatti wrote: > On my dbench runs I've noted a slowdown between pre4 and pre8 with 48 > threads. (128MB, 2 CPU's machine) Below some kernel compile numbers on a 32 MB RAM + 32 MB swap box. The three lines mean compilation with the -j1, -j2 and -j4 option. Most of the time 2.4.1pre8 was also unable to compile the kernel because cc1 was killed by OOM handler. Szaka 2.2.18 548.27user 94.18system 10:50 98%CPU (450479major+696869minor) 548.94user 153.85system11:51 98%CPU (487111major+704948minor) 599.44user 2018.66system 51:47 84%CPU (2295045major+1182819minor) = 2.4.0 557.18user 121.57system11:25 99%CPU (442434major+705429minor) 551.76user 158.78system12:11 97%CPU (446183major+711572minor) 579.65user 2860.53system 1:05:45 87%CPU (650964major+1209969minor) === 2.4.0+blk-13B 546.89user 140.35system11:33 99%CPU (442435major+705424minor) 570.73user 188.51system12:56 97%CPU (445171major+712791minor) 566.33user 2681.20system 1:02:26 86%CPU (654402major+1225784minor) = 2.4.1pre8 546.23user 118.81system11:09 99%CPU (442434major+705424minor) 569.12user 161.25system12:22 98%CPU (446667major+712457minor) 727.58user 2489.96system 1:25:34 62%CPU (616240major+1375321minor) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1pre8 slowdown on dbench tests
On Thu, 18 Jan 2001, Marcelo Tosatti wrote: On my dbench runs I've noted a slowdown between pre4 and pre8 with 48 threads. (128MB, 2 CPU's machine) Below some kernel compile numbers on a 32 MB RAM + 32 MB swap box. The three lines mean compilation with the -j1, -j2 and -j4 option. Most of the time 2.4.1pre8 was also unable to compile the kernel because cc1 was killed by OOM handler. Szaka 2.2.18 548.27user 94.18system 10:50 98%CPU (450479major+696869minor) 548.94user 153.85system11:51 98%CPU (487111major+704948minor) 599.44user 2018.66system 51:47 84%CPU (2295045major+1182819minor) = 2.4.0 557.18user 121.57system11:25 99%CPU (442434major+705429minor) 551.76user 158.78system12:11 97%CPU (446183major+711572minor) 579.65user 2860.53system 1:05:45 87%CPU (650964major+1209969minor) === 2.4.0+blk-13B 546.89user 140.35system11:33 99%CPU (442435major+705424minor) 570.73user 188.51system12:56 97%CPU (445171major+712791minor) 566.33user 2681.20system 1:02:26 86%CPU (654402major+1225784minor) = 2.4.1pre8 546.23user 118.81system11:09 99%CPU (442434major+705424minor) 569.12user 161.25system12:22 98%CPU (446667major+712457minor) 727.58user 2489.96system 1:25:34 62%CPU (616240major+1375321minor) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1pre8 slowdown on dbench tests
Redone with big enough swap by requests. 2.4.0,132MB swap 548.81user 128.97system11:22 99%CPU (442433major+705419minor) 561.12user 171.06system12:29 97%CPU (446949major+712525minor) 625.68user 2833.29system 1:12:38 79%CPU (638957major+1463974minor) === 2.4.1pre8,132MB swap 548.71user 117.93system11:09 99%CPU (442434major+705420minor) 558.93user 166.82system12:20 98%CPU (446941major+712662minor) 621.37user 2592.54system 1:07:33 79%CPU (592679major+1311442minor) Below some kernel compile numbers on a 32 MB RAM + 32 MB swap box. The three lines mean compilation with the -j1, -j2 and -j4 option. Most of the time 2.4.1pre8 was also unable to compile the kernel because cc1 was killed by OOM handler. 2.2.18 548.27user 94.18system 10:50 98%CPU (450479major+696869minor) 548.94user 153.85system11:51 98%CPU (487111major+704948minor) 599.44user 2018.66system 51:47 84%CPU (2295045major+1182819minor) = 2.4.0 557.18user 121.57system11:25 99%CPU (442434major+705429minor) 551.76user 158.78system12:11 97%CPU (446183major+711572minor) 579.65user 2860.53system 1:05:45 87%CPU (650964major+1209969minor) === 2.4.0+blk-13B 546.89user 140.35system11:33 99%CPU (442435major+705424minor) 570.73user 188.51system12:56 97%CPU (445171major+712791minor) 566.33user 2681.20system 1:02:26 86%CPU (654402major+1225784minor) = 2.4.1pre8 546.23user 118.81system11:09 99%CPU (442434major+705424minor) 569.12user 161.25system12:22 98%CPU (446667major+712457minor) 727.58user 2489.96system 1:25:34 62%CPU (616240major+1375321minor) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1pre8 slowdown on dbench tests
On Fri, 19 Jan 2001, Jens Axboe wrote: On Fri, Jan 19 2001, Szabolcs Szakacsits wrote: Redone with big enough swap by requests. 2.4.0,132MB swap 548.81user 128.97system11:22 99%CPU (442433major+705419minor) 561.12user 171.06system12:29 97%CPU (446949major+712525minor) 625.68user 2833.29system 1:12:38 79%CPU (638957major+1463974minor) === 2.4.1pre8,132MB swap 548.71user 117.93system11:09 99%CPU (442434major+705420minor) 558.93user 166.82system12:20 98%CPU (446941major+712662minor) 621.37user 2592.54system 1:07:33 79%CPU (592679major+1311442minor) Better, could you try with the number changes that Andrea suggested too? Thanks. Helped intensive swapping a bit, degraded other cases [no or slight swapping]. 2.4.1pre8,32MB RAM,132MB swap,blk suggestion 544.19user 141.25system11:31 99%CPU (442419major+705411minor) 554.83user 191.57system12:41 98%CPU (445762major+710409minor) 612.05user 2551.37system 1:07:21 78%CPU (589623major+1313665minor) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Subtle MM bug (really 830MB barrier question)
On Tue, 9 Jan 2001, Dan Maas wrote: > OK it's fairly obvious what's happening here. Your program is using > its own allocator, which relies solely on brk() to obtain more > memory. [... good explanation here ...] > Here's your short answer: ask the authors of your program to either > 1) replace their custom allocator with regular malloc() or 2) enhance > their custom allocator to use mmap. (or, buy some 64-bit hardware =)...) 3) ask kernel developers to get rid of this "brk hits the fixed start address of mmapped areas" or the other way around complaints "mmapped area should start at lower address" limitation. E.g. Solaris does growing up heap, growing down mmap and fixed size stack at the top. Wayne, the patch below should fix your barrier problem [1 GB physical memory configuration], I used only with 2.2 kernels. Your app should complain about out of memory around 2.7 GB (0xb000-0x08??), but note that only 256 MB (0xc000-0xb000) left for shared libraries, mmapped areas. Good luck, Szaka --- linux-2.2.18/include/asm-i386/processor.h Thu Dec 14 08:20:17 2000 +++ linux/include/asm-i386/processor.h Tue Jan 9 17:50:49 2001 @@ -166,7 +166,7 @@ /* This decides where the kernel will search for a free chunk of vm * space during mmap's. */ -#define TASK_UNMAPPED_BASE (TASK_SIZE / 3) +#define TASK_UNMAPPED_BASE 0xb000 /* * Size of io_bitmap in longwords: 32 is ports 0-0x3ff. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Subtle MM bug (really 830MB barrier question)
On Tue, 9 Jan 2001, Dan Maas wrote: OK it's fairly obvious what's happening here. Your program is using its own allocator, which relies solely on brk() to obtain more memory. [... good explanation here ...] Here's your short answer: ask the authors of your program to either 1) replace their custom allocator with regular malloc() or 2) enhance their custom allocator to use mmap. (or, buy some 64-bit hardware =)...) 3) ask kernel developers to get rid of this "brk hits the fixed start address of mmapped areas" or the other way around complaints "mmapped area should start at lower address" limitation. E.g. Solaris does growing up heap, growing down mmap and fixed size stack at the top. Wayne, the patch below should fix your barrier problem [1 GB physical memory configuration], I used only with 2.2 kernels. Your app should complain about out of memory around 2.7 GB (0xb000-0x08??), but note that only 256 MB (0xc000-0xb000) left for shared libraries, mmapped areas. Good luck, Szaka --- linux-2.2.18/include/asm-i386/processor.h Thu Dec 14 08:20:17 2000 +++ linux/include/asm-i386/processor.h Tue Jan 9 17:50:49 2001 @@ -166,7 +166,7 @@ /* This decides where the kernel will search for a free chunk of vm * space during mmap's. */ -#define TASK_UNMAPPED_BASE (TASK_SIZE / 3) +#define TASK_UNMAPPED_BASE 0xb000 /* * Size of io_bitmap in longwords: 32 is ports 0-0x3ff. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Subtle MM bug
Andi Kleen <[EMAIL PROTECTED]> wrote: > On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote: > > package called MAGMA; at times this requires very large matrices. The > > RSS can get up to 870MB; for some reason a MAGMA process under linux > > thinks it has run out of memory at 870MB, regardless of the actual > > memory/swap in the machine. MAGMA is single-threaded. > I think it's caused by the way malloc maps its memory. > Newer glibc should work a bit better by falling back to mmap even > for smaller allocations (older does it only for very big ones) AFAIK newer glibc = CVS glibc but the malloc() tune parameters work via environment variables for the current stable ones as well, e.g. to overcome the above "out of memory" one could do, % export MALLOC_MMAP_MAX_=100 % export MALLOC_MMAP_THRESHOLD_=0 % magma At default, on a 32bit Linux current stable glibc malloc uses brk between 0x08??-0x4000 and max (MALLOC_MMAP_MAX_) 128 mmap if the requested chunk is greater than 128 kB (MALLOC_MMAP_THRESHOLD_). If MAGMA mallocs memory in less than 128 kB chunks then the above out of memory behaviour is expected. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Subtle MM bug
Andi Kleen [EMAIL PROTECTED] wrote: On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote: package called MAGMA; at times this requires very large matrices. The RSS can get up to 870MB; for some reason a MAGMA process under linux thinks it has run out of memory at 870MB, regardless of the actual memory/swap in the machine. MAGMA is single-threaded. I think it's caused by the way malloc maps its memory. Newer glibc should work a bit better by falling back to mmap even for smaller allocations (older does it only for very big ones) AFAIK newer glibc = CVS glibc but the malloc() tune parameters work via environment variables for the current stable ones as well, e.g. to overcome the above "out of memory" one could do, % export MALLOC_MMAP_MAX_=100 % export MALLOC_MMAP_THRESHOLD_=0 % magma At default, on a 32bit Linux current stable glibc malloc uses brk between 0x08??-0x4000 and max (MALLOC_MMAP_MAX_) 128 mmap if the requested chunk is greater than 128 kB (MALLOC_MMAP_THRESHOLD_). If MAGMA mallocs memory in less than 128 kB chunks then the above out of memory behaviour is expected. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH-2] Re: NR_RESERVED_FILES broken in 2.4 too
On Sun, 10 Dec 2000, Tigran Aivazian wrote: > On Sun, 10 Dec 2000, Szabolcs Szakacsits wrote: > > - this comment from include/linux/fs.h should be deleted > > #define NR_RESERVED_FILES 10 /* reserved for root */ > well, not really -- it is "reserved" right now too, it is just root is > allowed to use up all the reserved entries in the beginning and then when > the normal user uses up all the "non-reserved" ones (from slab > cache) there would be nothing left for the root. And what real functionality does this provide? Close to nada. This is why I told you if you are right then it's usefuless. So I think this is a bug that was introduced accidentaly overlooking NR_RESERVED_FILES functionality when get_empty_filp was rewritten to use the slab. > But let us not argue about the above definition of "reserved" -- that is > not productive. Agree, this is why I made the patch ;) Also, this stupid misunderstanding and waste of time between us is a *very* typical example of the result of the super inferior Linux kernel source code management. No way to dig up who and why dropped the reserved file functionality about three years ago. "Hidden", unexplained patches slip in almost every patch-set. Some developers think they can save a huge amount of time by this "model", they just ignore other developers and support people who need to understand what, when, why and by who a changes happened. And because of lack of enough information [look, both of us have and I think understand the code, still we don't agree] the end result is that, apparently by now too many times the ball is dropped back to these developers who get buried by even more job. This is just one sign Linux has a hard future and unfortunately there are others In general Linux is still one of the best today but without addressing and solving current development problems it will not be true after a couple of years. Linux remains just another Unix and lose in 1:100 to another OS. The source is with us but it should be used properly > Let's do something productive -- namely, take your idea to > the next logical step. Since you have proven that the freelist mechanism > or concept of "reserve file structures" is not 100% satisfactory as is This is also a difference between us. You look the problem from a theoretical point of you, saying it's not 100%, I consider it from practical point of you and say it gives close to 0% functionality for users. > then how about removing the freelist altogether? I.e. what about serving I'm fine with the current implementation and more interested in bug fixes. There could be one reason against the patch, performance. The patch below has the same fix and TUX will give exactly the same numbers [get_empty_filp code remains ugly but at least fast]. Szaka diff -ur linux-2.4.0-test12-pre7/fs/file_table.c linux/fs/file_table.c --- linux-2.4.0-test12-pre7/fs/file_table.c Fri Dec 8 08:17:12 2000 +++ linux/fs/file_table.c Mon Dec 11 10:40:41 2000 @@ -57,7 +57,9 @@ /* * Allocate a new one if we're below the limit. */ - if (files_stat.nr_files < files_stat.max_files) { + if ((files_stat.nr_files < files_stat.max_files) && (!current->euid || +NR_RESERVED_FILES - files_stat.nr_free_files < +files_stat.max_files - files_stat.nr_files)) { file_list_unlock(); f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL); file_list_lock(); diff -ur linux-2.4.0-test12-pre7/include/linux/fs.h linux/include/linux/fs.h --- linux-2.4.0-test12-pre7/include/linux/fs.h Fri Dec 8 15:06:55 2000 +++ linux/include/linux/fs.hSun Dec 10 17:37:52 2000 @@ -57,7 +57,7 @@ extern int leases_enable, dir_notify_enable, lease_break_time; #define NR_FILE 8192 /* this can well be larger on a larger system */ -#define NR_RESERVED_FILES 10 /* reserved for root */ +#define NR_RESERVED_FILES 128 /* reserved for root */ #define NR_SUPER 256 #define MAY_EXEC 1 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH-2] Re: NR_RESERVED_FILES broken in 2.4 too
On Sun, 10 Dec 2000, Tigran Aivazian wrote: On Sun, 10 Dec 2000, Szabolcs Szakacsits wrote: - this comment from include/linux/fs.h should be deleted #define NR_RESERVED_FILES 10 /* reserved for root */ well, not really -- it is "reserved" right now too, it is just root is allowed to use up all the reserved entries in the beginning and then when the normal user uses up all the "non-reserved" ones (from slab cache) there would be nothing left for the root. And what real functionality does this provide? Close to nada. This is why I told you if you are right then it's usefuless. So I think this is a bug that was introduced accidentaly overlooking NR_RESERVED_FILES functionality when get_empty_filp was rewritten to use the slab. But let us not argue about the above definition of "reserved" -- that is not productive. Agree, this is why I made the patch ;) Also, this stupid misunderstanding and waste of time between us is a *very* typical example of the result of the super inferior Linux kernel source code management. No way to dig up who and why dropped the reserved file functionality about three years ago. "Hidden", unexplained patches slip in almost every patch-set. Some developers think they can save a huge amount of time by this "model", they just ignore other developers and support people who need to understand what, when, why and by who a changes happened. And because of lack of enough information [look, both of us have and I think understand the code, still we don't agree] the end result is that, apparently by now too many times the ball is dropped back to these developers who get buried by even more job. This is just one sign Linux has a hard future and unfortunately there are others In general Linux is still one of the best today but without addressing and solving current development problems it will not be true after a couple of years. Linux remains just another Unix and lose in 1:100 to another OS. The source is with us but it should be used properly Let's do something productive -- namely, take your idea to the next logical step. Since you have proven that the freelist mechanism or concept of "reserve file structures" is not 100% satisfactory as is This is also a difference between us. You look the problem from a theoretical point of you, saying it's not 100%, I consider it from practical point of you and say it gives close to 0% functionality for users. then how about removing the freelist altogether? I.e. what about serving I'm fine with the current implementation and more interested in bug fixes. There could be one reason against the patch, performance. The patch below has the same fix and TUX will give exactly the same numbers [get_empty_filp code remains ugly but at least fast]. Szaka diff -ur linux-2.4.0-test12-pre7/fs/file_table.c linux/fs/file_table.c --- linux-2.4.0-test12-pre7/fs/file_table.c Fri Dec 8 08:17:12 2000 +++ linux/fs/file_table.c Mon Dec 11 10:40:41 2000 @@ -57,7 +57,9 @@ /* * Allocate a new one if we're below the limit. */ - if (files_stat.nr_files files_stat.max_files) { + if ((files_stat.nr_files files_stat.max_files) (!current-euid || +NR_RESERVED_FILES - files_stat.nr_free_files +files_stat.max_files - files_stat.nr_files)) { file_list_unlock(); f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL); file_list_lock(); diff -ur linux-2.4.0-test12-pre7/include/linux/fs.h linux/include/linux/fs.h --- linux-2.4.0-test12-pre7/include/linux/fs.h Fri Dec 8 15:06:55 2000 +++ linux/include/linux/fs.hSun Dec 10 17:37:52 2000 @@ -57,7 +57,7 @@ extern int leases_enable, dir_notify_enable, lease_break_time; #define NR_FILE 8192 /* this can well be larger on a larger system */ -#define NR_RESERVED_FILES 10 /* reserved for root */ +#define NR_RESERVED_FILES 128 /* reserved for root */ #define NR_SUPER 256 #define MAY_EXEC 1 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too
On Sun, 10 Dec 2000, Tigran Aivazian wrote: > If, however, you believe that the above _is_ the case but it should _not_ > happen then you are proposing a completely new policy of file structure > allocation which you believe is superior. It is quite possible so let's > all understand your new policy and let Linus decide whether it's better > than the existing one. But if so, don't tell me you are fixing a bug > because it is not a bug -- it's a redesign of file structure allocator. If it's not a bug then - this comment from include/linux/fs.h should be deleted #define NR_RESERVED_FILES 10 /* reserved for root */ - books should be updated - people's mind also who believe kernel reserves fd's for superuser Kernel from 2.1 plays lottery in this regards. And this would be another sad fact that the kernel is exteremely poor *out of the box* in regards security and relaibility ... Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too
On Sun, 10 Dec 2000, Tigran Aivazian wrote: > problem (e.g. you mentioned something about allocating more than NR_FILES > on SMP -- what do you mean?) which you are not explaining clearly. E.g. situation, only one file struct left for allocation. One CPU goes into get_empty_filp and before kmem_cache_alloc unlocks file_list, another CPU gets also into get_empty_filp and locks file_list at the top and goes on the same path, the end result potentially can be both will increase nr_files instead of only one. But I don't think it's a big issue at *present* that could cause any problems ... > You just say "it is broken and here is the patch" but that, imho, is not > enough. (ok, one could overcome the laziness and actually _read_ your > patch to see what you _think_ is broken but surely it is better if you > explain it yourself?). Sorry I didn't explain, I thought it's short enough and significantly faster to understand reading the code then my poor English ;) Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too
On Sun, 10 Dec 2000, Tigran Aivazian wrote: > > user% ./fd-exhaustion # e.g. while(1) open("/dev/null",...); > > root# cat /proc/sys/fs/file-nr > > cat: /proc/sys/fs/file-nr: Too many open files in system > > > > The above happens even with increased NR_RESERVED_FILES to 96 [no > > wonder, get_empty_filp is broken]. > > no, it is not broken. But your experiment is broken. Don't do cat file-nr > but compile this C program Ok, now I understand why you can't see the problem ;) You lookup the values in user space but I did it [additionally] in kernel space [also I think I understand what happens ;)]. I guess with the code below you claim I shouldn't see values like this when file struct allocations started by user apps, 1024 0 1024 Or 0 shouldn't be between 0 and NR_RESERVED_FILES. Right? Wrong. I saw it happens, you can reproduce it if you lookup the nr_free_files value, allocate that much by root, don't release them and immediately after this start to allocate fd's by user app. Note, if you already hit nr_files = max_files you won't ever be able to reproduce the above - but this is a half solution, kernel 2.0 was fine, get_empty_filp was broke somewhere between 2.0 and 2.1 and it's still broken. With the patch the functionality is back and also works the way what the authors of the book mentioned believe ;) It's quite funny, because before I was also told this is broken but I couldn't believe it, so I look the code and tested it, the report was right ... Still disagree? ;) Szaka > #include > #include > #include > #include > #include > #include > > int main(int argc, char *argv[]) > { > int fd, len; > static char buf[2048]; > > fd = open("/proc/sys/fs/file-nr", O_RDONLY); > if (fd == -1) { > perror("open"); > exit(1); > } > while (1) { > len = read(fd, buf, 1024); > printf("len=%d %s", len, buf); > lseek(fd, 0, SEEK_SET); > sleep(1); > } > return 0; > } > > and leave it running while doing experiments on the other console. You > will see that everything is fine -- there is no bug. No wonder you saw the > bug -- you ignored my 4 emails telling you otherwise :) > > Regards, > Tigran > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.2.18pre25: VM: do_try_to_free_pages failed for
thunder7 wrote: > for almost everything: >Dec 10 13:33:47 middle kernel: VM: do_try_to_free_pages failed for kswapd... [] > watched fsck mull over 60+ Gb> You could try out my patch that "reserves" virtual memory for root, so you should be able to login/ssh and clean up if your "faulty" or memory hungry daemons aren't run by root -- it works fine for me and I didn't get negative feedback so far: http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_vm+oom_killer-5.diff More on the patch, http://boudicca.tux.org/hypermail/linux-kernel/2000week48/0624.html > Most messages I was able to dig up about this mentioned 2.2.17 and > suggested upgrading to 2.2.18pre. I didn't think there is anything > changed between 2.2.18pre25 and 2.2.18pre26(2.2.18 to be) in VM > handling, so the problem still seems to persist. What are the > suggestions? Moving to 2.4 is not possible, since the isdn > compression module isdn_lzscomp.o won't work in 2.4. Andrea Arcangeli's VM global patch got good feedback and according to Alan Cox it's a potential candidate for 2.2.19, ftp://ftp.nl.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre18/VM-global-2.2.18pre18-7.bz2 Good luck, Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] NR_RESERVED_FILES broken in 2.4 too
On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote: > On Thu, 7 Dec 2000, Tigran Aivazian wrote: > > On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote: > > > Read the whole get_empty_filp function, especially this part, > > I have read the whole function, including the above code, of course. The > > new_one label has nothing to do with freelists -- it adds the file to the > > anon_list, where the new arrivales from the slab cache go. The goto > > new_one above is there simply to initialize the structure with sane > > initial values > OK, 2.2 has put_inuse(f); instead of putting it to anon_list, so 2.4 > seems ok. Back to common sense ;) Nevertheless what you wrote additionally get_empty_filp returns an allocated file struct that gets to be used. So ignoring your four emails arguing kernel is ok, I downloaded 2.4-test11-pre7 and tried it out. root# echo 1024 > /proc/sys/fs/file-max Unpatched kernel, user% ./fd-exhaustion # e.g. while(1) open("/dev/null",...); root# cat /proc/sys/fs/file-nr cat: /proc/sys/fs/file-nr: Too many open files in system The above happens even with increased NR_RESERVED_FILES to 96 [no wonder, get_empty_filp is broken]. With the patch below, user% ./fd-exhaustion root# cat /proc/sys/fs/file-nr 946 0 1024 or 1024 78 1024 or something that also works The patch also has a fix not to allocate potentially more file structs than NR_FILES on SMP. Unfortunately NR_RESERVED_FILES needs to be increased to be useful [i.e. e.g. to make ssh|login+ps|kill work for superuser]. Other way would be to more aggressively free unused file structs if kernel is short on free fd's. > > There are even books (Understanding the Linux > > Kernel by Bovet et all) which describe this freelist in the > > current context so your patch will require updates to the books. Checked this part of the book, ok for 2.0 but not for 2.[24]. Szaka diff -ur linux-2.4.0-test12-pre7/fs/file_table.c linux/fs/file_table.c --- linux-2.4.0-test12-pre7/fs/file_table.c Fri Dec 8 08:17:12 2000 +++ linux/fs/file_table.c Sun Dec 10 17:05:55 2000 @@ -32,39 +32,36 @@ { static int old_max = 0; struct file * f; + int total_free; file_list_lock(); - if (files_stat.nr_free_files > NR_RESERVED_FILES) { - used_one: - f = list_entry(free_list.next, struct file, f_list); - list_del(>f_list); - files_stat.nr_free_files--; - new_one: - memset(f, 0, sizeof(*f)); - atomic_set(>f_count,1); - f->f_version = ++event; - f->f_uid = current->fsuid; - f->f_gid = current->fsgid; - list_add(>f_list, _list); - file_list_unlock(); - return f; - } - /* -* Use a reserved one if we're the superuser -*/ - if (files_stat.nr_free_files && !current->euid) - goto used_one; - /* -* Allocate a new one if we're below the limit. -*/ - if (files_stat.nr_files < files_stat.max_files) { + total_free = files_stat.max_files - files_stat.nr_files + +files_stat.nr_free_files; + if (total_free > NR_RESERVED_FILES || (total_free && !current->euid)) { + if (files_stat.nr_free_files) { + /* used_one */ + f = list_entry(free_list.next, struct file, f_list); + list_del(>f_list); + files_stat.nr_free_files--; + new_one: + memset(f, 0, sizeof(*f)); + atomic_set(>f_count,1); + f->f_version = ++event; + f->f_uid = current->fsuid; + f->f_gid = current->fsgid; + list_add(>f_list, _list); + file_list_unlock(); + return f; + } + /* +* Allocate a new one if we're below the limit. +*/ + files_stat.nr_files++; file_list_unlock(); f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL); file_list_lock(); - if (f) { - files_stat.nr_files++; + if (f) goto new_one; - } + files_stat.nr_files--; /* Big problems... */ printk("VFS: filp allocation failed\n"); diff -ur linux-2.4.0-test12-pre7/include/linux/fs.h linux/include/linux/fs.h --- linux-2.4.0-test12-pre7/include/linux/fs.h Fri Dec 8 15:06:55 2000 +++ linux/include/linux/fs.hSun Dec 10 17:37:52 2000 @@ -57,7 +57,7 @@ extern int leases_enable, dir_notify_enable, lease_break_time; #define NR_FILE 8192
Re: 2.2.18pre25: VM: do_try_to_free_pages failed for
thunder7 wrote: for almost everything: Dec 10 13:33:47 middle kernel: VM: do_try_to_free_pages failed for kswapd... [] tried to log in over the network, didn't work, pressed C-A-D and watched fsck mull over 60+ Gb You could try out my patch that "reserves" virtual memory for root, so you should be able to login/ssh and clean up if your "faulty" or memory hungry daemons aren't run by root -- it works fine for me and I didn't get negative feedback so far: http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_vm+oom_killer-5.diff More on the patch, http://boudicca.tux.org/hypermail/linux-kernel/2000week48/0624.html Most messages I was able to dig up about this mentioned 2.2.17 and suggested upgrading to 2.2.18pre. I didn't think there is anything changed between 2.2.18pre25 and 2.2.18pre26(2.2.18 to be) in VM handling, so the problem still seems to persist. What are the suggestions? Moving to 2.4 is not possible, since the isdn compression module isdn_lzscomp.o won't work in 2.4. Andrea Arcangeli's VM global patch got good feedback and according to Alan Cox it's a potential candidate for 2.2.19, ftp://ftp.nl.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre18/VM-global-2.2.18pre18-7.bz2 Good luck, Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too
On Sun, 10 Dec 2000, Tigran Aivazian wrote: user% ./fd-exhaustion # e.g. while(1) open("/dev/null",...); root# cat /proc/sys/fs/file-nr cat: /proc/sys/fs/file-nr: Too many open files in system The above happens even with increased NR_RESERVED_FILES to 96 [no wonder, get_empty_filp is broken]. no, it is not broken. But your experiment is broken. Don't do cat file-nr but compile this C program Ok, now I understand why you can't see the problem ;) You lookup the values in user space but I did it [additionally] in kernel space [also I think I understand what happens ;)]. I guess with the code below you claim I shouldn't see values like this when file struct allocations started by user apps, 1024 0 1024 Or 0 shouldn't be between 0 and NR_RESERVED_FILES. Right? Wrong. I saw it happens, you can reproduce it if you lookup the nr_free_files value, allocate that much by root, don't release them and immediately after this start to allocate fd's by user app. Note, if you already hit nr_files = max_files you won't ever be able to reproduce the above - but this is a half solution, kernel 2.0 was fine, get_empty_filp was broke somewhere between 2.0 and 2.1 and it's still broken. With the patch the functionality is back and also works the way what the authors of the book mentioned believe ;) It's quite funny, because before I was also told this is broken but I couldn't believe it, so I look the code and tested it, the report was right ... Still disagree? ;) Szaka #include sys/types.h #include sys/stat.h #include unistd.h #include fcntl.h #include stdio.h #include stdlib.h int main(int argc, char *argv[]) { int fd, len; static char buf[2048]; fd = open("/proc/sys/fs/file-nr", O_RDONLY); if (fd == -1) { perror("open"); exit(1); } while (1) { len = read(fd, buf, 1024); printf("len=%d %s", len, buf); lseek(fd, 0, SEEK_SET); sleep(1); } return 0; } and leave it running while doing experiments on the other console. You will see that everything is fine -- there is no bug. No wonder you saw the bug -- you ignored my 4 emails telling you otherwise :) Regards, Tigran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too
On Sun, 10 Dec 2000, Tigran Aivazian wrote: problem (e.g. you mentioned something about allocating more than NR_FILES on SMP -- what do you mean?) which you are not explaining clearly. E.g. situation, only one file struct left for allocation. One CPU goes into get_empty_filp and before kmem_cache_alloc unlocks file_list, another CPU gets also into get_empty_filp and locks file_list at the top and goes on the same path, the end result potentially can be both will increase nr_files instead of only one. But I don't think it's a big issue at *present* that could cause any problems ... You just say "it is broken and here is the patch" but that, imho, is not enough. (ok, one could overcome the laziness and actually _read_ your patch to see what you _think_ is broken but surely it is better if you explain it yourself?). Sorry I didn't explain, I thought it's short enough and significantly faster to understand reading the code then my poor English ;) Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too
On Sun, 10 Dec 2000, Tigran Aivazian wrote: If, however, you believe that the above _is_ the case but it should _not_ happen then you are proposing a completely new policy of file structure allocation which you believe is superior. It is quite possible so let's all understand your new policy and let Linus decide whether it's better than the existing one. But if so, don't tell me you are fixing a bug because it is not a bug -- it's a redesign of file structure allocator. If it's not a bug then - this comment from include/linux/fs.h should be deleted #define NR_RESERVED_FILES 10 /* reserved for root */ - books should be updated - people's mind also who believe kernel reserves fd's for superuser Kernel from 2.1 plays lottery in this regards. And this would be another sad fact that the kernel is exteremely poor *out of the box* in regards security and relaibility ... Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Broken NR_RESERVED_FILES
On Thu, 7 Dec 2000, Tigran Aivazian wrote: > On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote: > > Read the whole get_empty_filp function, especially this part, note the > > goto new_one below and the part you didn't include above [from > > the new_one label], > > > > if (files_stat.nr_files < files_stat.max_files) { > > file_list_unlock(); > > f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL); > > file_list_lock(); > > if (f) { > > files_stat.nr_files++; > > goto new_one; > > } > > I have read the whole function, including the above code, of course. The > new_one label has nothing to do with freelists -- it adds the file to the > anon_list, where the new arrivales from the slab cache go. The goto > new_one above is there simply to initialize the structure with sane > initial values OK, 2.2 has put_inuse(f); instead of putting it to anon_list, so 2.4 seems ok. Szaka > So, the normal user _cannot_ take a file structure from the freelist > unless it contains more than NR_RESERVED_FILE entries. Please read the > whole function and see it for yourself. > > Regards, > Tigran > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Broken NR_RESERVED_FILES
On Thu, 7 Dec 2000, Tigran Aivazian wrote: > On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote: > > again. The failed logic is also clear from the kernel code [user > > happily allocates when freelist < NR_RESERVED_FILES]. > > is it clear to you? it is not clear to me, or rather the opposite seems > clear. This is what the code looks like (in 2.4): > > struct file * get_empty_filp(void) > { > static int old_max = 0; > struct file * f; > > file_list_lock(); > if (files_stat.nr_free_files > NR_RESERVED_FILES) { > used_one: > f = list_entry(free_list.next, struct file, f_list); > list_del(>f_list); > files_stat.nr_free_files--; > > so, a normal user is only allowed to allocate from the freelist when the > number of elements on the freelist is > NR_RESERVED_FILES. I do not see > how you are able to take elements from the freelist when the number is < > NR_RESERVED_FILES unless you are a super-user, i.e. current->euid == 0. Read the whole get_empty_filp function, especially this part, note the goto new_one below and the part you didn't include above [from the new_one label], if (files_stat.nr_files < files_stat.max_files) { file_list_unlock(); f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL); file_list_lock(); if (f) { files_stat.nr_files++; goto new_one; } > Btw, while you are there (in 2.2 kernel) you may want to fix the Sorry, no time. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Broken NR_RESERVED_FILES
On Thu, 7 Dec 2000, Tigran Aivazian wrote: > On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote: > > Reserved fd's for superuser doesn't work. > It does actually work, What do you mean under "work"? I meant user apps are able to exhaust fd's completely and none is left for superuser. > but remember that the concept of "reserved file > structures for superuser" is defined as "file structures to be taken from > the freelist" Yes, in this sense it works and it's also very close to unhelpfulness. > whereas your patch below: [...] > allows one to allocate a file structure from the filp_cache slab cache if > one is a superuser. Or one is user and didn't hit yet the reserved fd's (and of course superuser aren't able to allocate more then max_files). Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Broken NR_RESERVED_FILES
Reserved fd's for superuser doesn't work. Patch for 2.2 is below, kernel 2.4.x also has this problem, fix is similar. The default NR_RESERVED_FILES value also had to be increased (e.g. ssh, login needs 36, ls 16, man 45 fd's, etc). BTW, I have an updated version of my reserved VM for superuser + improved/fixed version of Rik's out of memory killer patch for 2.2 here, http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_vm+oom_killer-5.diff It fixes the potential deadlock when kernel threads were blocked to try to free pages - more details about the patch are in a former email, http://boudicca.tux.org/hypermail/linux-kernel/2000week48/0624.html Szaka diff -ur linux-2.2.18pre21/fs/file_table.c linux/fs/file_table.c --- linux-2.2.18pre21/fs/file_table.c Tue Jan 4 13:12:23 2000 +++ linux/fs/file_table.c Thu Dec 7 13:26:06 2000 @@ -71,30 +71,27 @@ { static int old_max = 0; struct file * f; + int total_free; - if (nr_free_files > NR_RESERVED_FILES) { - used_one: - f = free_filps; - remove_filp(f); - nr_free_files--; - new_one: - memset(f, 0, sizeof(*f)); - f->f_count = 1; - f->f_version = ++global_event; - f->f_uid = current->fsuid; - f->f_gid = current->fsgid; - put_inuse(f); - return f; - } - /* -* Use a reserved one if we're the superuser -*/ - if (nr_free_files && !current->euid) - goto used_one; - /* -* Allocate a new one if we're below the limit. -*/ - if (nr_files < max_files) { + total_free = max_files - nr_files + nr_free_files; + if (total_free > NR_RESERVED_FILES || (total_free && !current->euid)) { + if (nr_free_files) { + used_one: + f = free_filps; + remove_filp(f); + nr_free_files--; + new_one: + memset(f, 0, sizeof(*f)); + f->f_count = 1; + f->f_version = ++global_event; + f->f_uid = current->fsuid; + f->f_gid = current->fsgid; + put_inuse(f); + return f; + } + /* +* Allocate a new one if we're below the limit. + */ f = kmem_cache_alloc(filp_cache, SLAB_KERNEL); if (f) { nr_files++; diff -ur linux-2.2.18pre21/include/linux/fs.h linux/include/linux/fs.h --- linux-2.2.18pre21/include/linux/fs.hThu Nov 9 08:20:18 2000 +++ linux/include/linux/fs.hThu Dec 7 11:10:50 2000 @@ -51,7 +51,7 @@ extern int max_super_blocks, nr_super_blocks; #define NR_FILE 4096 /* this can well be larger on a larger system */ -#define NR_RESERVED_FILES 10 /* reserved for root */ +#define NR_RESERVED_FILES 96 /* reserved for root */ #define NR_SUPER 256 #define MAY_EXEC 1 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Broken NR_RESERVED_FILES
Reserved fd's for superuser doesn't work. Patch for 2.2 is below, kernel 2.4.x also has this problem, fix is similar. The default NR_RESERVED_FILES value also had to be increased (e.g. ssh, login needs 36, ls 16, man 45 fd's, etc). BTW, I have an updated version of my reserved VM for superuser + improved/fixed version of Rik's out of memory killer patch for 2.2 here, http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_vm+oom_killer-5.diff It fixes the potential deadlock when kernel threads were blocked to try to free pages - more details about the patch are in a former email, http://boudicca.tux.org/hypermail/linux-kernel/2000week48/0624.html Szaka diff -ur linux-2.2.18pre21/fs/file_table.c linux/fs/file_table.c --- linux-2.2.18pre21/fs/file_table.c Tue Jan 4 13:12:23 2000 +++ linux/fs/file_table.c Thu Dec 7 13:26:06 2000 @@ -71,30 +71,27 @@ { static int old_max = 0; struct file * f; + int total_free; - if (nr_free_files NR_RESERVED_FILES) { - used_one: - f = free_filps; - remove_filp(f); - nr_free_files--; - new_one: - memset(f, 0, sizeof(*f)); - f-f_count = 1; - f-f_version = ++global_event; - f-f_uid = current-fsuid; - f-f_gid = current-fsgid; - put_inuse(f); - return f; - } - /* -* Use a reserved one if we're the superuser -*/ - if (nr_free_files !current-euid) - goto used_one; - /* -* Allocate a new one if we're below the limit. -*/ - if (nr_files max_files) { + total_free = max_files - nr_files + nr_free_files; + if (total_free NR_RESERVED_FILES || (total_free !current-euid)) { + if (nr_free_files) { + used_one: + f = free_filps; + remove_filp(f); + nr_free_files--; + new_one: + memset(f, 0, sizeof(*f)); + f-f_count = 1; + f-f_version = ++global_event; + f-f_uid = current-fsuid; + f-f_gid = current-fsgid; + put_inuse(f); + return f; + } + /* +* Allocate a new one if we're below the limit. + */ f = kmem_cache_alloc(filp_cache, SLAB_KERNEL); if (f) { nr_files++; diff -ur linux-2.2.18pre21/include/linux/fs.h linux/include/linux/fs.h --- linux-2.2.18pre21/include/linux/fs.hThu Nov 9 08:20:18 2000 +++ linux/include/linux/fs.hThu Dec 7 11:10:50 2000 @@ -51,7 +51,7 @@ extern int max_super_blocks, nr_super_blocks; #define NR_FILE 4096 /* this can well be larger on a larger system */ -#define NR_RESERVED_FILES 10 /* reserved for root */ +#define NR_RESERVED_FILES 96 /* reserved for root */ #define NR_SUPER 256 #define MAY_EXEC 1 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Broken NR_RESERVED_FILES
On Thu, 7 Dec 2000, Tigran Aivazian wrote: On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote: Reserved fd's for superuser doesn't work. It does actually work, What do you mean under "work"? I meant user apps are able to exhaust fd's completely and none is left for superuser. but remember that the concept of "reserved file structures for superuser" is defined as "file structures to be taken from the freelist" Yes, in this sense it works and it's also very close to unhelpfulness. whereas your patch below: [...] allows one to allocate a file structure from the filp_cache slab cache if one is a superuser. Or one is user and didn't hit yet the reserved fd's (and of course superuser aren't able to allocate more then max_files). Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Broken NR_RESERVED_FILES
On Thu, 7 Dec 2000, Tigran Aivazian wrote: On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote: again. The failed logic is also clear from the kernel code [user happily allocates when freelist NR_RESERVED_FILES]. is it clear to you? it is not clear to me, or rather the opposite seems clear. This is what the code looks like (in 2.4): struct file * get_empty_filp(void) { static int old_max = 0; struct file * f; file_list_lock(); if (files_stat.nr_free_files NR_RESERVED_FILES) { used_one: f = list_entry(free_list.next, struct file, f_list); list_del(f-f_list); files_stat.nr_free_files--; so, a normal user is only allowed to allocate from the freelist when the number of elements on the freelist is NR_RESERVED_FILES. I do not see how you are able to take elements from the freelist when the number is NR_RESERVED_FILES unless you are a super-user, i.e. current-euid == 0. Read the whole get_empty_filp function, especially this part, note the goto new_one below and the part you didn't include above [from the new_one label], if (files_stat.nr_files files_stat.max_files) { file_list_unlock(); f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL); file_list_lock(); if (f) { files_stat.nr_files++; goto new_one; } Btw, while you are there (in 2.2 kernel) you may want to fix the Sorry, no time. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Broken NR_RESERVED_FILES
On Thu, 7 Dec 2000, Tigran Aivazian wrote: On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote: Read the whole get_empty_filp function, especially this part, note the goto new_one below and the part you didn't include above [from the new_one label], if (files_stat.nr_files files_stat.max_files) { file_list_unlock(); f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL); file_list_lock(); if (f) { files_stat.nr_files++; goto new_one; } I have read the whole function, including the above code, of course. The new_one label has nothing to do with freelists -- it adds the file to the anon_list, where the new arrivales from the slab cache go. The goto new_one above is there simply to initialize the structure with sane initial values OK, 2.2 has put_inuse(f); instead of putting it to anon_list, so 2.4 seems ok. Szaka So, the normal user _cannot_ take a file structure from the freelist unless it contains more than NR_RESERVED_FILE entries. Please read the whole function and see it for yourself. Regards, Tigran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reserved root VM + OOM killer
On Thu, 23 Nov 2000, Pavel Machek wrote: > > HOW? > > No performance loss, RAM is always fully utilized (except if no swap), > > Handheld machines never have any swap, and alwys have little RAM [trust me, > velo1 I'm writing this on is so tuned that 100KB les and machine is useless]. > Unless reservation can be turned off, it is not acceptable. Okay, it can > be tuned. Ok, then. > > [What about making default reserved space 10% of *swap* size?] No. Many people uses no swap even if they have plenty of RAM. I wasn't right when I wrote the "reserved" VM is on swap or in buffer/page cache. I wanted to write the reserved VM is unused swap and/or it is *used* as buffer/page cache until it's not needed by root. Left away swap from the former sentence and you get no RAM is wasted at all ;) Moreover the default value for boxes with less than 8MB is 0 pages (I thought about "embedded" systems), it's 5 MB if the box has more then 100MB and 5% of the RAM but after considered it as part of the VM between 8MB and 100MB. I found in my setup, at least 4 MB needed to be useful if root wants to act sure. Of course this can be different in other setups and application behaviours -- this is why it can be tuned runtime. Using more "reserved" [this is really a stupid and not accurate name] VM definitely helps :) BTW, apparently Solaris reserves 4 MB for root. I also thought about making it a compile time option [for people using Linux as embedded systmes] in that case you would have less than 25% chance to save one page -- I would instead optimize the compiler ;) but maybe embedded systems use non-overcomittable memory handling, I didn't look how they handle OOM. I'm afraid I was also wrong about performance, here is a typical case how standard 2.2 kernel works if OOM happens: killing gpm, vmstat, syslogd, tail, httpd, zsh, identd, httpd, klogd, httpd, httpd, httpd [the main httpd, web is dead], bad_app. If there is more bad_app [working on the same problem but e.g. they were feeded by wrong input, etc], then you have the big chance you must hit the reset button. With Rik's OOM killer, the "right" processes are killed but I found the system trashes too long and because of the constant memory pressure you still must hit the reset button. With my patch + fixes of Rik's OOM killer, the "right" processes are killed fast [it's done only in page fault, contrary to 2.4.0-test11 that has two OOM killer: one in page fault and Rik's one ... pretty ugly] and you can do whatever you want as root. It would be nice to see which one of the three cases would finish a job first where multiply processes [not threads] work on the same job saving the partial results and constantly producing OOM. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reserved root VM + OOM killer
On Thu, 23 Nov 2000, Pavel Machek wrote: HOW? No performance loss, RAM is always fully utilized (except if no swap), Handheld machines never have any swap, and alwys have little RAM [trust me, velo1 I'm writing this on is so tuned that 100KB les and machine is useless]. Unless reservation can be turned off, it is not acceptable. Okay, it can be tuned. Ok, then. [What about making default reserved space 10% of *swap* size?] No. Many people uses no swap even if they have plenty of RAM. I wasn't right when I wrote the "reserved" VM is on swap or in buffer/page cache. I wanted to write the reserved VM is unused swap and/or it is *used* as buffer/page cache until it's not needed by root. Left away swap from the former sentence and you get no RAM is wasted at all ;) Moreover the default value for boxes with less than 8MB is 0 pages (I thought about "embedded" systems), it's 5 MB if the box has more then 100MB and 5% of the RAM but after considered it as part of the VM between 8MB and 100MB. I found in my setup, at least 4 MB needed to be useful if root wants to act sure. Of course this can be different in other setups and application behaviours -- this is why it can be tuned runtime. Using more "reserved" [this is really a stupid and not accurate name] VM definitely helps :) BTW, apparently Solaris reserves 4 MB for root. I also thought about making it a compile time option [for people using Linux as embedded systmes] in that case you would have less than 25% chance to save one page -- I would instead optimize the compiler ;) but maybe embedded systems use non-overcomittable memory handling, I didn't look how they handle OOM. I'm afraid I was also wrong about performance, here is a typical case how standard 2.2 kernel works if OOM happens: killing gpm, vmstat, syslogd, tail, httpd, zsh, identd, httpd, klogd, httpd, httpd, httpd [the main httpd, web is dead], bad_app. If there is more bad_app [working on the same problem but e.g. they were feeded by wrong input, etc], then you have the big chance you must hit the reset button. With Rik's OOM killer, the "right" processes are killed but I found the system trashes too long and because of the constant memory pressure you still must hit the reset button. With my patch + fixes of Rik's OOM killer, the "right" processes are killed fast [it's done only in page fault, contrary to 2.4.0-test11 that has two OOM killer: one in page fault and Rik's one ... pretty ugly] and you can do whatever you want as root. It would be nice to see which one of the three cases would finish a job first where multiply processes [not threads] work on the same job saving the partial results and constantly producing OOM. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reserved root VM + OOM killer
On Wed, 22 Nov 2000, Rik van Riel wrote: > On Wed, 22 Nov 2000, Szabolcs Szakacsits wrote: > > >- OOM killing takes place only in do_page_fault() [no two places in > > the kernel for process killing] > > ... disable OOM killing for non-x86 architectures. > This doesn't seem like a smart move ;) > > > diff -urw linux-2.2.18pre21/arch/i386/mm/Makefile linux/arch/i386/mm/Makefile > > --- linux-2.2.18pre21/arch/i386/mm/Makefile Fri Nov 1 04:56:43 1996 ^ As I wrote, the OOM killer changes are x86 only at present. Other arch's still use the default OOM killing defined in arch/*/mm/fault.c. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Reserved root VM + OOM killer
WHY? Permanent memory need by user apps makes Linux uncontrollable in OOM (out of memory) situation when OOM killer can't kill as fast as the memory needed (and your superb 'free memory space' monitor/actor developed in the last 5 years was also killed and init couldn't restart it because of OOM). In the Unix world it's a common good practice to reserve resources for root (see e.g. disk space, network ports, file descriptors, processes, etc). Linux doesn't reserve virtual memory for root so if OOM happens by user apps you get this kind of messages as root when trying to make the system work again properly or investigate what happend, running a command from prompt: Memory exhausted Segmentation fault fork failed: resource temporarily unavailable trying to login from console: Unable to load interpreter /lib/ld-linux.so.2 error while loading shared libraries: libc.so.6: cannot map zero-fill pages: Cannot allocate memory error while loading shared libraries: libtermcap.so.2: failed to map segment from shared object: Cannot allocate memory xrealloc: cannot reallocate 128 bytes (0 bytes allocated) xmalloc: cannot allocate 562 bytes (0 bytes allocated) trying to ssh via network: Received disconnect: Command terminated on signal 11. WHAT? This patch tries to reserve virtual memory for root, balance memory usage between root and user apps if memory is overcommited and has Rik's OOM killer that is much more clever about what to kill when OOM happens than what's inlcuded in standard 2.2 kernels. HOW? No performance loss, RAM is always fully utilized (except if no swap), the tunable reserved memory is on swap (or in caches) until it's needed by root. There are two scenarios. When user apps don't overcommit memory they will see only UVM = (real virtual memory) - (reserved virtual memory for root) If memory is overcommited then user apps will also use the reserved memory (otherwise there would be a performance loss as I guess) but the kernel will try hard to push them back below UVM. IN THE PATCH: - reserved VM for root - Rik's OOM killer from 2.4.0-test11 with "fixes": - PID 1 never gets killed by OOM killer - OOM killing takes place only in do_page_fault() [no two places in the kernel for process killing] - niced processes are not penalized - IPC shared mem can only be kvazi-overcommited (i.e. request is successful only if there is enough VM at the request time) NOTES: - it's for 2.2 (late) kernels [tested with 2.2.18pre21, applies to 2.2.18pre22 as well] - Intel only [page fault handling is implemented differently in different architectures, no common hooks but easy to fix] - SMP not tested - GUI environment not tested - tests were done with constant brk, mmap, zfod, cow, IPC shm fork bombs on mostly a 64-128 RAM MB + 80 MB swap box. - using IPC shared mem still can "kill" the box (unused mem not freed). Use Solar Designer kernel security patch or set /proc/sys/kernel/shmall according to your VM - it's not for common fork bombs (use e.g. fair scheduler, Fork Bomb Defuser, etc against them). Use ulimit -u if you want to test the patch and don't have enough CPU power - the reserved virtual memory can be set runtime via /proc/sys/vm/reserved The value is in pages (4096 bytes on x86) - On SMP you should probably increase this value in the function of you CPU's - if you have GB's of VM you can experience malloc() scalability problems, use glibc 2.2, limit your VM, raise the limits via malloc environment variables, etc. PROBLEMS: - if killable task is in TASK_UNINTERRUPTIBLE constantly [e.g. becasue of network fs (smb, nfs, etc) problems] then OOM killer won't work ... at least this is what I suspect - schedule() doesn't always immediately schedules the killable task - probably others I'm not aware of Standard disclaimer applies. It worked fine for me but maybe it will eat your whole computer and pets :) It's not perfect but seems good enough and I definitely found it much better then what is in 2.2 kernels. Of course your experience can be completely different. Please let me know. Szaka diff -urw linux-2.2.18pre21/arch/i386/mm/Makefile linux/arch/i386/mm/Makefile --- linux-2.2.18pre21/arch/i386/mm/Makefile Fri Nov 1 04:56:43 1996 +++ linux/arch/i386/mm/Makefile Tue Nov 21 03:03:15 2000 @@ -8,6 +8,6 @@ # Note 2! The CFLAGS definition is now in the main makefile... O_TARGET := mm.o -O_OBJS := init.o fault.o ioremap.o extable.o +O_OBJS := init.o fault.o ioremap.o extable.o ../../../mm/oom_kill.o include $(TOPDIR)/Rules.make diff -urw linux-2.2.18pre21/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c --- linux-2.2.18pre21/arch/i386/mm/fault.c Wed May 3 20:16:31 2000 +++ linux/arch/i386/mm/fault.c Tue Nov 21 05:49:36 2000 @@ -23,6 +23,7 @@ #include extern void die(const char *,struct pt_regs *,long); +extern int oom_kill(void); /* * Ugly, ugly,
[PATCH] Reserved root VM + OOM killer
WHY? Permanent memory need by user apps makes Linux uncontrollable in OOM (out of memory) situation when OOM killer can't kill as fast as the memory needed (and your superb 'free memory space' monitor/actor developed in the last 5 years was also killed and init couldn't restart it because of OOM). In the Unix world it's a common good practice to reserve resources for root (see e.g. disk space, network ports, file descriptors, processes, etc). Linux doesn't reserve virtual memory for root so if OOM happens by user apps you get this kind of messages as root when trying to make the system work again properly or investigate what happend, running a command from prompt: Memory exhausted Segmentation fault fork failed: resource temporarily unavailable trying to login from console: Unable to load interpreter /lib/ld-linux.so.2 error while loading shared libraries: libc.so.6: cannot map zero-fill pages: Cannot allocate memory error while loading shared libraries: libtermcap.so.2: failed to map segment from shared object: Cannot allocate memory xrealloc: cannot reallocate 128 bytes (0 bytes allocated) xmalloc: cannot allocate 562 bytes (0 bytes allocated) trying to ssh via network: Received disconnect: Command terminated on signal 11. WHAT? This patch tries to reserve virtual memory for root, balance memory usage between root and user apps if memory is overcommited and has Rik's OOM killer that is much more clever about what to kill when OOM happens than what's inlcuded in standard 2.2 kernels. HOW? No performance loss, RAM is always fully utilized (except if no swap), the tunable reserved memory is on swap (or in caches) until it's needed by root. There are two scenarios. When user apps don't overcommit memory they will see only UVM = (real virtual memory) - (reserved virtual memory for root) If memory is overcommited then user apps will also use the reserved memory (otherwise there would be a performance loss as I guess) but the kernel will try hard to push them back below UVM. IN THE PATCH: - reserved VM for root - Rik's OOM killer from 2.4.0-test11 with "fixes": - PID 1 never gets killed by OOM killer - OOM killing takes place only in do_page_fault() [no two places in the kernel for process killing] - niced processes are not penalized - IPC shared mem can only be kvazi-overcommited (i.e. request is successful only if there is enough VM at the request time) NOTES: - it's for 2.2 (late) kernels [tested with 2.2.18pre21, applies to 2.2.18pre22 as well] - Intel only [page fault handling is implemented differently in different architectures, no common hooks but easy to fix] - SMP not tested - GUI environment not tested - tests were done with constant brk, mmap, zfod, cow, IPC shm fork bombs on mostly a 64-128 RAM MB + 80 MB swap box. - using IPC shared mem still can "kill" the box (unused mem not freed). Use Solar Designer kernel security patch or set /proc/sys/kernel/shmall according to your VM - it's not for common fork bombs (use e.g. fair scheduler, Fork Bomb Defuser, etc against them). Use ulimit -u if you want to test the patch and don't have enough CPU power - the reserved virtual memory can be set runtime via /proc/sys/vm/reserved The value is in pages (4096 bytes on x86) - On SMP you should probably increase this value in the function of you CPU's - if you have GB's of VM you can experience malloc() scalability problems, use glibc 2.2, limit your VM, raise the limits via malloc environment variables, etc. PROBLEMS: - if killable task is in TASK_UNINTERRUPTIBLE constantly [e.g. becasue of network fs (smb, nfs, etc) problems] then OOM killer won't work ... at least this is what I suspect - schedule() doesn't always immediately schedules the killable task - probably others I'm not aware of Standard disclaimer applies. It worked fine for me but maybe it will eat your whole computer and pets :) It's not perfect but seems good enough and I definitely found it much better then what is in 2.2 kernels. Of course your experience can be completely different. Please let me know. Szaka diff -urw linux-2.2.18pre21/arch/i386/mm/Makefile linux/arch/i386/mm/Makefile --- linux-2.2.18pre21/arch/i386/mm/Makefile Fri Nov 1 04:56:43 1996 +++ linux/arch/i386/mm/Makefile Tue Nov 21 03:03:15 2000 @@ -8,6 +8,6 @@ # Note 2! The CFLAGS definition is now in the main makefile... O_TARGET := mm.o -O_OBJS := init.o fault.o ioremap.o extable.o +O_OBJS := init.o fault.o ioremap.o extable.o ../../../mm/oom_kill.o include $(TOPDIR)/Rules.make diff -urw linux-2.2.18pre21/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c --- linux-2.2.18pre21/arch/i386/mm/fault.c Wed May 3 20:16:31 2000 +++ linux/arch/i386/mm/fault.c Tue Nov 21 05:49:36 2000 @@ -23,6 +23,7 @@ #include asm/hardirq.h extern void die(const char *,struct pt_regs *,long); +extern int oom_kill(void); /* *
Re: [PATCH] Reserved root VM + OOM killer
On Wed, 22 Nov 2000, Rik van Riel wrote: On Wed, 22 Nov 2000, Szabolcs Szakacsits wrote: - OOM killing takes place only in do_page_fault() [no two places in the kernel for process killing] ... disable OOM killing for non-x86 architectures. This doesn't seem like a smart move ;) diff -urw linux-2.2.18pre21/arch/i386/mm/Makefile linux/arch/i386/mm/Makefile --- linux-2.2.18pre21/arch/i386/mm/Makefile Fri Nov 1 04:56:43 1996 ^ As I wrote, the OOM killer changes are x86 only at present. Other arch's still use the default OOM killing defined in arch/*/mm/fault.c. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: KPATCH] Reserve VM for root (was: Re: Looking for better VM)
On Thu, 16 Nov 2000, Rik van Riel wrote: > On Thu, 16 Nov 2000, Szabolcs Szakacsits wrote: > [snip exploit that really shouldn't take Linux down] I don't really consider it as an exploit. It's a kind of workload that's optimized for fast testing simulating many busy user daemons (e.g. dynamically generating web pages). Everybody knows Slashdot effect. A system was designed for a workload according to a budget and other factors. But immediately the load gets *much* higher as it was ever expected, the system starts to trash and nobody can login or start new processes. You can pull off the cable but if it's a remote box then you are really in a bad situation. Or if a local [e.g. computing] batch job run away you also must hit the reset button. Happens too many times that it should be really taken seriously now, who don't believe should just search for typical OOM crash patterns of user reports on different mailling lists/newsgroups. > > This or something similar didn't kill the box [I've tried all local > > DoS from Packetstorm that I could find]. Please send a working > > example. Of course probably it's possible to trigger root owned > > processes to eat memory eagerly by user apps but that's a problem in > > the process design running as root and not a kernel issue. > Not necessarily, but your patch will probably make a difference > for quite a number of people... Could you please explain what you mean? ;) I saw only ONE difference. The system remains usable for root but not anybody else. Everything else is the same as before. Of course I think there are still problems with the patch but to be honest I don't know what they are ... except those I wrote before -- e.g. the latest, not yet released version definitely doesn't work with your OOM killer [system just trashes]. Can you explain why you put process killing in do_try_to_free_pages() instead of the original place, do_page_fault()? I guess putting it in do_page_fault() [if possible] would fix my current problem. > > If you think fork() kills the box then ulimit the maximum number > > of user processes (ulimit -u). This is a different issue and a > > bad design in the scheduler (see e.g. Tru64 for a better one). > My fair scheduler catches this one just fine. It hasn't > been integrated in the kernel yet, but both VA Linux and > Conectiva use it in their kernel RPM. I know about two fair schedulers for Linux, one of them is yours but I couldn't try them out yet. Anyway definitely a must ;) > While this is not one of the sexy new kernel > features, this will help quite a few system > administrators and is destined to a long and > healthy life inside kernel RPMs, maybe even > in the main kernel tree (when 2.5 splits?). Thanks for the feedback, Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: KPATCH] Reserve VM for root (was: Re: Looking for better VM)
On Wed, 1 Jan 1997 [EMAIL PROTECTED] wrote: >>main() { while(1) if (fork()) malloc(1); } >>With the patch below I could ssh to the host and killall the offending >>processes. To enable reserving VM space for root do > what about main() { while(1) system("ftp localhost &"); } > This. or so,ething similar should allow you to kill your machine > even with your patch from normal user account This or something similar didn't kill the box [I've tried all local DoS from Packetstorm that I could find]. Please send a working example. Of course probably it's possible to trigger root owned processes to eat memory eagerly by user apps but that's a problem in the process design running as root and not a kernel issue. Note, I'm not discussing "local user can kill the box without limits", I say Linux "deadlocks" [it starts its own autonom life and usually your only chance is to hit the reset button] when there is continuous VM pressure by user applications. If you think fork() kills the box then ulimit the maximum number of user processes (ulimit -u). This is a different issue and a bad design in the scheduler (see e.g. Tru64 for a better one). BTW, I have a new version of the patch with that Linux behaves much better from root's point of view when the memory is more significantly overcommited. I'll post it if I have time [and there is interest]. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: KPATCH] Reserve VM for root (was: Re: Looking for better VM)
On Wed, 1 Jan 1997 [EMAIL PROTECTED] wrote: main() { while(1) if (fork()) malloc(1); } With the patch below I could ssh to the host and killall the offending processes. To enable reserving VM space for root do what about main() { while(1) system("ftp localhost "); } This. or so,ething similar should allow you to kill your machine even with your patch from normal user account This or something similar didn't kill the box [I've tried all local DoS from Packetstorm that I could find]. Please send a working example. Of course probably it's possible to trigger root owned processes to eat memory eagerly by user apps but that's a problem in the process design running as root and not a kernel issue. Note, I'm not discussing "local user can kill the box without limits", I say Linux "deadlocks" [it starts its own autonom life and usually your only chance is to hit the reset button] when there is continuous VM pressure by user applications. If you think fork() kills the box then ulimit the maximum number of user processes (ulimit -u). This is a different issue and a bad design in the scheduler (see e.g. Tru64 for a better one). BTW, I have a new version of the patch with that Linux behaves much better from root's point of view when the memory is more significantly overcommited. I'll post it if I have time [and there is interest]. Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: KPATCH] Reserve VM for root (was: Re: Looking for better VM)
On Thu, 16 Nov 2000, Rik van Riel wrote: On Thu, 16 Nov 2000, Szabolcs Szakacsits wrote: [snip exploit that really shouldn't take Linux down] I don't really consider it as an exploit. It's a kind of workload that's optimized for fast testing simulating many busy user daemons (e.g. dynamically generating web pages). Everybody knows Slashdot effect. A system was designed for a workload according to a budget and other factors. But immediately the load gets *much* higher as it was ever expected, the system starts to trash and nobody can login or start new processes. You can pull off the cable but if it's a remote box then you are really in a bad situation. Or if a local [e.g. computing] batch job run away you also must hit the reset button. Happens too many times that it should be really taken seriously now, who don't believe should just search for typical OOM crash patterns of user reports on different mailling lists/newsgroups. This or something similar didn't kill the box [I've tried all local DoS from Packetstorm that I could find]. Please send a working example. Of course probably it's possible to trigger root owned processes to eat memory eagerly by user apps but that's a problem in the process design running as root and not a kernel issue. Not necessarily, but your patch will probably make a difference for quite a number of people... Could you please explain what you mean? ;) I saw only ONE difference. The system remains usable for root but not anybody else. Everything else is the same as before. Of course I think there are still problems with the patch but to be honest I don't know what they are ... except those I wrote before -- e.g. the latest, not yet released version definitely doesn't work with your OOM killer [system just trashes]. Can you explain why you put process killing in do_try_to_free_pages() instead of the original place, do_page_fault()? I guess putting it in do_page_fault() [if possible] would fix my current problem. If you think fork() kills the box then ulimit the maximum number of user processes (ulimit -u). This is a different issue and a bad design in the scheduler (see e.g. Tru64 for a better one). My fair scheduler catches this one just fine. It hasn't been integrated in the kernel yet, but both VA Linux and Conectiva use it in their kernel RPM. I know about two fair schedulers for Linux, one of them is yours but I couldn't try them out yet. Anyway definitely a must ;) While this is not one of the sexy new kernel features, this will help quite a few system administrators and is destined to a long and healthy life inside kernel RPMs, maybe even in the main kernel tree (when 2.5 splits?). Thanks for the feedback, Szaka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/