Re: BTRFS partition usage...

2008-02-13 Thread Szabolcs Szakacsits

On Tue, 12 Feb 2008, Jeff Garzik wrote:
> 
> Yep.  I chose 32K unused space in the prototype filesystem I wrote [1, 2.4
> era].  I'm pretty sure I got that number from some other filesystem, maybe
> even some NTFS incarnation.  

NTFS superblock (and the partial mirror copy) can be anywhere except in the 
first blocks. That space is where the $BOOT file is placed which contains 
the bootstrap code and the BIOS Paramether Block which includes the NTFS 
signature and describes various filesystem information needed to locate the 
superblock, etc.

Unlike mkfs.xfs which warns since at least 2002 and requires the -f option 
to override Sun disklabels, at the moment mkfs.ntfs will indeed destroy 
them. 

Thank you for the bug report and let's hope the next generation of Sun 
hardwares won't scatter the firmware too into random places inside a 
partition encoded by a fictitious size of disk cylinder.

Szaka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BTRFS partition usage...

2008-02-13 Thread Szabolcs Szakacsits

On Tue, 12 Feb 2008, Jeff Garzik wrote:
 
 Yep.  I chose 32K unused space in the prototype filesystem I wrote [1, 2.4
 era].  I'm pretty sure I got that number from some other filesystem, maybe
 even some NTFS incarnation.  

NTFS superblock (and the partial mirror copy) can be anywhere except in the 
first blocks. That space is where the $BOOT file is placed which contains 
the bootstrap code and the BIOS Paramether Block which includes the NTFS 
signature and describes various filesystem information needed to locate the 
superblock, etc.

Unlike mkfs.xfs which warns since at least 2002 and requires the -f option 
to override Sun disklabels, at the moment mkfs.ntfs will indeed destroy 
them. 

Thank you for the bug report and let's hope the next generation of Sun 
hardwares won't scatter the firmware too into random places inside a 
partition encoded by a fictitious size of disk cylinder.

Szaka
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] util-linux-ng: unprivileged mounts support

2008-01-19 Thread Szabolcs Szakacsits

On Sat, 19 Jan 2008, Miklos Szeredi wrote:
> 
> But 'fusermount -u /tmp/test' does work, doesn't it?

You're submitting patches to get rid of fusermount, aren't you?

Most users absolutely have no idea what fusermount is and they would 
__really__ like to see umount(8) working finally. 

Szaka

--
NTFS-3G:  http://ntfs-3g.org


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] util-linux-ng: unprivileged mounts support

2008-01-19 Thread Szabolcs Szakacsits

On Wed, 16 Jan 2008, Miklos Szeredi wrote:

> This is an experimental patch for supporing unprivileged mounts and
> umounts.  

User unmount unfortunately still doesn't work if the kernel doesn't have 
the unprivileged mount support but as we discussed this in last July that 
shouldn't be needed for this case.

  % mount -t ntfs-3g /dev/hda10 /tmp/test
  % cat /proc/mounts | grep /tmp/test   
  
  /dev/hda10 /tmp/test fuseblk 
rw,nosuid,nodev,user_id=501,group_id=501,allow_other 0 0
  % mount | grep /tmp/test
  /dev/hda10 on /tmp/test type fuseblk 
(rw,nosuid,nodev,allow_other,blksize=1024,user=szaka)
  % umount /tmp/test
  umount: /dev/hda10: not mounted
  umount: /tmp/test: must be superuser to umount
  umount: /dev/hda10: not mounted
  umount: /tmp/test: must be superuser to umount

Szaka

--
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] util-linux-ng 2.13.1 (stable)

2008-01-19 Thread Szabolcs Szakacsits

On Wed, 16 Jan 2008, Karel Zak wrote:

> mount:
>- doesn't drop privileges properly when calling helpers  [Ludwig Nussel]

How can a mount helper know without being setuid root and redundantly doing 
mount(8)'s work that the user is allowed to mount via the 'user[s]' fstab 
mount option? 

Szaka

--
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] util-linux-ng 2.13.1 (stable)

2008-01-19 Thread Szabolcs Szakacsits

On Wed, 16 Jan 2008, Karel Zak wrote:

 mount:
- doesn't drop privileges properly when calling helpers  [Ludwig Nussel]

How can a mount helper know without being setuid root and redundantly doing 
mount(8)'s work that the user is allowed to mount via the 'user[s]' fstab 
mount option? 

Szaka

--
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] util-linux-ng: unprivileged mounts support

2008-01-19 Thread Szabolcs Szakacsits

On Wed, 16 Jan 2008, Miklos Szeredi wrote:

 This is an experimental patch for supporing unprivileged mounts and
 umounts.  

User unmount unfortunately still doesn't work if the kernel doesn't have 
the unprivileged mount support but as we discussed this in last July that 
shouldn't be needed for this case.

  % mount -t ntfs-3g /dev/hda10 /tmp/test
  % cat /proc/mounts | grep /tmp/test   
  
  /dev/hda10 /tmp/test fuseblk 
rw,nosuid,nodev,user_id=501,group_id=501,allow_other 0 0
  % mount | grep /tmp/test
  /dev/hda10 on /tmp/test type fuseblk 
(rw,nosuid,nodev,allow_other,blksize=1024,user=szaka)
  % umount /tmp/test
  umount: /dev/hda10: not mounted
  umount: /tmp/test: must be superuser to umount
  umount: /dev/hda10: not mounted
  umount: /tmp/test: must be superuser to umount

Szaka

--
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] util-linux-ng: unprivileged mounts support

2008-01-19 Thread Szabolcs Szakacsits

On Sat, 19 Jan 2008, Miklos Szeredi wrote:
 
 But 'fusermount -u /tmp/test' does work, doesn't it?

You're submitting patches to get rid of fusermount, aren't you?

Most users absolutely have no idea what fusermount is and they would 
__really__ like to see umount(8) working finally. 

Szaka

--
NTFS-3G:  http://ntfs-3g.org


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Szabolcs Szakacsits

On Tue, 15 Jan 2008, Daniel Phillips wrote:

> Along with this effort, could you let me know if the world actually
> cares about online fsck?  Now we know how to do it I think, but is it
> worth the effort.

Most users seem to care deeply about "things just work". Here is why 
ntfs-3g also took the online fsck path some time ago.

NTFS support had a highly bad reputation on Linux thus the new code was 
written with rigid sanity checks and extensive automatic, regression 
testing. One of the consequences is that we're detecting way too many 
inconsistencies left behind by the Windows and other NTFS drivers, 
hardware faults, device drivers.

To better utilize the non-existing developer resources, it was obvious to 
suggest the already existing Windows fsck (chkdsk) in such cases. Simple 
and safe as most people like us would think who never used Windows. 

However years of experience shows that depending on several factors chkdsk 
may start or not, may report the real problems or not, but on the other 
hand it may report bogus issues, it may run long or just forever, and it 
may even remove completely valid files. So one could perhaps even consider 
suggestions to run chkdsk a call to play Russian roulette.

Thankfully NTFS has some level of metadata redundancy with signatures and 
weak "checksums" which make possible to correct some common and obvious 
corruptions on the fly.

Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:
http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df0f6e6805d1033.mspx?mfr=true

Szaka

--
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Szabolcs Szakacsits

On Tue, 15 Jan 2008, Daniel Phillips wrote:

 Along with this effort, could you let me know if the world actually
 cares about online fsck?  Now we know how to do it I think, but is it
 worth the effort.

Most users seem to care deeply about things just work. Here is why 
ntfs-3g also took the online fsck path some time ago.

NTFS support had a highly bad reputation on Linux thus the new code was 
written with rigid sanity checks and extensive automatic, regression 
testing. One of the consequences is that we're detecting way too many 
inconsistencies left behind by the Windows and other NTFS drivers, 
hardware faults, device drivers.

To better utilize the non-existing developer resources, it was obvious to 
suggest the already existing Windows fsck (chkdsk) in such cases. Simple 
and safe as most people like us would think who never used Windows. 

However years of experience shows that depending on several factors chkdsk 
may start or not, may report the real problems or not, but on the other 
hand it may report bogus issues, it may run long or just forever, and it 
may even remove completely valid files. So one could perhaps even consider 
suggestions to run chkdsk a call to play Russian roulette.

Thankfully NTFS has some level of metadata redundancy with signatures and 
weak checksums which make possible to correct some common and obvious 
corruptions on the fly.

Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:
http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df0f6e6805d1033.mspx?mfr=true

Szaka

--
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Szabolcs Szakacsits

Hi,

On Wed, 9 Jan 2008, Nigel Cunningham wrote:
> On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote:
> >
> > For the suspend issue, there are also no easy solutions.
> 
> What are the non-easy solutions?

A practical point of view I've seen only fuse rootfs mounts to be a 
problem. I remember Ubuntu patches for this (WUBI and some other distros 
install NTFS root). But this probably also depends on the used suspend 
implementation.

Personally I've never had fuse related suspend problem with ordinary mounts 
during heavy use under development, nor NTFS user problem was tracked down 
to it in the last one and half year.

Regards,
Szaka

-- 
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Szabolcs Szakacsits

Hi,

On Wed, 9 Jan 2008, Nigel Cunningham wrote:
 On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote:
 
  For the suspend issue, there are also no easy solutions.
 
 What are the non-easy solutions?

A practical point of view I've seen only fuse rootfs mounts to be a 
problem. I remember Ubuntu patches for this (WUBI and some other distros 
install NTFS root). But this probably also depends on the used suspend 
implementation.

Personally I've never had fuse related suspend problem with ordinary mounts 
during heavy use under development, nor NTFS user problem was tracked down 
to it in the last one and half year.

Regards,
Szaka

-- 
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts

2008-01-08 Thread Szabolcs Szakacsits

On Tue, 8 Jan 2008, Miklos Szeredi wrote:
> > On Tue, 2008-01-08 at 12:35 +0100, Miklos Szeredi wrote:
> > > +static int reserve_user_mount(void)
> > > +{
> > > +   int err = 0;
> > > +
> > > +   spin_lock(_lock);
> > > +   if (nr_user_mounts >= max_user_mounts && !capable(CAP_SYS_ADMIN))
> > > +   err = -EPERM;
> > > +   else
> > > +   nr_user_mounts++;
> > > +   spin_unlock(_lock);
> > > +   return err;
> > > +} 
> > 
> > Would -ENOSPC or -ENOMEM be a more descriptive error here?  
> 
> The logic behind EPERM, is that this failure is only for unprivileged
> callers.  ENOMEM is too specifically about OOM.  It could be changed
> to ENOSPC, ENFILE, EMFILE, or it could remain EPERM.  What do others
> think?

I think it would be important to log the non-trivial errors. Several 
mount(8) hints to check for the reason by dmesg since it's already too 
challanging to figure out what's exactly the problem by the errno value. 
This could also prevent to mislead troubleshooters with the mount/sysctl 
race.

Szaka

-- 
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts

2008-01-08 Thread Szabolcs Szakacsits

On Tue, 8 Jan 2008, Miklos Szeredi wrote:
  On Tue, 2008-01-08 at 12:35 +0100, Miklos Szeredi wrote:
   +static int reserve_user_mount(void)
   +{
   +   int err = 0;
   +
   +   spin_lock(vfsmount_lock);
   +   if (nr_user_mounts = max_user_mounts  !capable(CAP_SYS_ADMIN))
   +   err = -EPERM;
   +   else
   +   nr_user_mounts++;
   +   spin_unlock(vfsmount_lock);
   +   return err;
   +} 
  
  Would -ENOSPC or -ENOMEM be a more descriptive error here?  
 
 The logic behind EPERM, is that this failure is only for unprivileged
 callers.  ENOMEM is too specifically about OOM.  It could be changed
 to ENOSPC, ENFILE, EMFILE, or it could remain EPERM.  What do others
 think?

I think it would be important to log the non-trivial errors. Several 
mount(8) hints to check for the reason by dmesg since it's already too 
challanging to figure out what's exactly the problem by the errno value. 
This could also prevent to mislead troubleshooters with the mount/sysctl 
race.

Szaka

-- 
NTFS-3G:  http://ntfs-3g.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-27 Thread Szabolcs Szakacsits

On Sat, 27 Oct 2007, Anton Altaparmakov wrote:

> And another of my pet peeves with ->bmap is that it uses 0 to mean "sparse"
> which causes a conflict on NTFS at least as block zero is part of the $Boot
> system file so it is a real, valid block...  NTFS uses -1 to denote sparse
> blocks internally.

In practice, the meaning of 0 is file system [driver] dependent. For 
example in case of NTFS-3G it means that the block is sparse or the file is 
encrypted or compressed, or resident, or it's the $Boot file, or an error 
happened.

Thankfully the widely used FIBMAP users (swapon and the ever less used 
lilo) are only interested in the non-zero values and they report an error 
if the driver returns 0 for some reason. Which is perfectly ok since both 
swaping and Linux booting would fail using a sparse, encrypted, compressed, 
resident, or the NTFS $Boot file. 

But in real, both swap files and lilo work fine with NTFS if the needed 
files were created the way these softwares expect. If not then swapon or 
lilo will catch and report the file creation error.

Afair, somebody is doing (has done?) an indeed much needed, better 
alternative. Bmap is legacy, thank you Mike for maintaining it.

Szaka

--
NTFS-3G Lead Developer:  http://ntfs-3g.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-27 Thread Szabolcs Szakacsits

On Sat, 27 Oct 2007, Anton Altaparmakov wrote:

 And another of my pet peeves with -bmap is that it uses 0 to mean sparse
 which causes a conflict on NTFS at least as block zero is part of the $Boot
 system file so it is a real, valid block...  NTFS uses -1 to denote sparse
 blocks internally.

In practice, the meaning of 0 is file system [driver] dependent. For 
example in case of NTFS-3G it means that the block is sparse or the file is 
encrypted or compressed, or resident, or it's the $Boot file, or an error 
happened.

Thankfully the widely used FIBMAP users (swapon and the ever less used 
lilo) are only interested in the non-zero values and they report an error 
if the driver returns 0 for some reason. Which is perfectly ok since both 
swaping and Linux booting would fail using a sparse, encrypted, compressed, 
resident, or the NTFS $Boot file. 

But in real, both swap files and lilo work fine with NTFS if the needed 
files were created the way these softwares expect. If not then swapon or 
lilo will catch and report the file creation error.

Afair, somebody is doing (has done?) an indeed much needed, better 
alternative. Bmap is legacy, thank you Mike for maintaining it.

Szaka

--
NTFS-3G Lead Developer:  http://ntfs-3g.org
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: curedump configuration additions

2001-05-06 Thread Szabolcs Szakacsits


On Sat, 5 May 2001, Michael Miller wrote:

> +coredump_enabled:
> +When enabled (which is the default), Linux will produce
[...]
> +coredump_log:
> +The default is to log coredumps.

The default looks like an effective way to DoS logging, fill system
partition fast.

Nice other optional feature would be for development, debug, QA point of
view to be able to dump set[ug]id or apps that changed its uid or gid, ala
kern.sugid_coredump (FreeBSD)
kern.nosuidcoredump (OpenBSD)
allow_setid_core (Solaris)
etc.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Thread core dumps for 2.4.4

2001-05-06 Thread Szabolcs Szakacsits


On Thu, 3 May 2001, Don Dugger wrote:

> The attached patch allows core dumps from thread processes in the 2.4.4
> kernel.  This patch is the same as the last one I sent out except it fixes
> the same bug that `kernel/fork.c' had with duplicate info in the `mm'
> structure, plus this patch has had more extensive testing.

AFAIK Linux can't dump the threads to the same file as others but doing
it to different files looks a bit scary. How the system behaves when you
dump a heavy threaded app with a decent VM [i.e just think about a
bloatware instead of malicious code]? How will the developer know which
thread caused the fault? I've found dumping just the faulting thread is
enough about 100% of the cases especially because [on SMP] others can
run on and the dump is much more close to "garbage" then usuful info
from a debug point of view.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Thread core dumps for 2.4.4

2001-05-06 Thread Szabolcs Szakacsits


On Thu, 3 May 2001, Don Dugger wrote:

 The attached patch allows core dumps from thread processes in the 2.4.4
 kernel.  This patch is the same as the last one I sent out except it fixes
 the same bug that `kernel/fork.c' had with duplicate info in the `mm'
 structure, plus this patch has had more extensive testing.

AFAIK Linux can't dump the threads to the same file as others but doing
it to different files looks a bit scary. How the system behaves when you
dump a heavy threaded app with a decent VM [i.e just think about a
bloatware instead of malicious code]? How will the developer know which
thread caused the fault? I've found dumping just the faulting thread is
enough about 100% of the cases especially because [on SMP] others can
run on and the dump is much more close to garbage then usuful info
from a debug point of view.

Szaka

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: curedump configuration additions

2001-05-06 Thread Szabolcs Szakacsits


On Sat, 5 May 2001, Michael Miller wrote:

 +coredump_enabled:
 +When enabled (which is the default), Linux will produce
[...]
 +coredump_log:
 +The default is to log coredumps.

The default looks like an effective way to DoS logging, fill system
partition fast.

Nice other optional feature would be for development, debug, QA point of
view to be able to dump set[ug]id or apps that changed its uid or gid, ala
kern.sugid_coredump (FreeBSD)
kern.nosuidcoredump (OpenBSD)
allow_setid_core (Solaris)
etc.

Szaka

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: __alloc_pages: 4-order allocation failed

2001-04-26 Thread Szabolcs Szakacsits


On Thu, 26 Apr 2001, Jeff V. Merkey wrote:

> I am seeing this as well on 2.4.3 with both _get_free_pages() and
> kmalloc().  In the kmalloc case, the modules hang waiting
> for memory.

One possible source of this hang is due to the change below in
2.4.3, non GPF_ATOMIC and non-recursive allocations (PF_MEMALLOC is set)
will loop until the requested continuous memory is available.

Szaka

diff -u --recursive --new-file v2.4.2/linux/mm/page_alloc.c
linux/mm/page_alloc.c--- v2.4.2/linux/mm/page_alloc.cSat Feb  3
19:51:32 2001
+++ linux/mm/page_alloc.c   Tue Mar 20 15:05:46 2001
@@ -455,8 +455,7 @@
memory_pressure++;
try_to_free_pages(gfp_mask);
wakeup_bdflush(0);
-   if (!order)
-   goto try_again;
+   goto try_again;
}
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer *WORKS* for a change!

2001-04-17 Thread Szabolcs Szakacsits


On Fri, 13 Apr 2001, Mike A. Harris wrote:

> I just ran netscape which for some reason or another went totally
> whacky and gobbled RAM.  It has done this before and made the box
> totally unuseable in 2.2.17-2.2.19 befor the kernel killed 90% of
> my running apps before getting the right one.

I ported the 2.4 OOM killer about half year ago to 2.2, available for
2.2.19 kernel at
http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_memory.html

Note, since it's activated in page fault handler that is architecture
dependent, the current patch works only on x86 (the only one I could
test). If one is interested in other archs, let me know.

   Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer *WORKS* for a change!

2001-04-17 Thread Szabolcs Szakacsits


On Fri, 13 Apr 2001, Mike A. Harris wrote:

 I just ran netscape which for some reason or another went totally
 whacky and gobbled RAM.  It has done this before and made the box
 totally unuseable in 2.2.17-2.2.19 befor the kernel killed 90% of
 my running apps before getting the right one.

I ported the 2.4 OOM killer about half year ago to 2.2, available for
2.2.19 kernel at
http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_memory.html

Note, since it's activated in page fault handler that is architecture
dependent, the current patch works only on x86 (the only one I could
test). If one is interested in other archs, let me know.

   Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scheduler went mad?

2001-04-12 Thread Szabolcs Szakacsits


On Thu, 12 Apr 2001, Rik van Riel wrote:
> On Thu, 12 Apr 2001, Szabolcs Szakacsits wrote:
> > You mean without dropping out_of_memory() test in kswapd and calling
> > oom_kill() in page fault [i.e. without additional patch]?
> No.  I think it's ok for __alloc_pages() to call oom_kill()
> IF we turn out to be out of memory, but that should not even
> be needed.

Not __alloc_pages() calls oom_kill() however do_page_fault(). Not the
same. After the system tried *really* hard to get *one* free page and
couldn't managed why loop forever? To eat CPU and waiting for
out_of_memory() to *guess* when system is in OOM? I don't think so, if
processes can't progress because system can't page in any of their
pages, somebody must go.

> Also, when a task in __alloc_pages() is OOM-killed, it will
> have PF_MEMALLOC set and will immediately break out of the
> loop. The rest of the system will spin around in the loop
> until the victim has exited and then their allocations will
> succeed.

Yes, I think this is a problem. In page fault if OOM, "bad" process
selected, scheduled, killed and everybody runs happily even without to
notice system is low on memory. Fast and gracious process killing
instead of slow, painful death IF out_of_memory() correctly detects OOM.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scheduler went mad?

2001-04-12 Thread Szabolcs Szakacsits


On Thu, 12 Apr 2001, Rik van Riel wrote:
> On Thu, 12 Apr 2001, Szabolcs Szakacsits wrote:
> > I still feel a bit unconfortable about processes looping forever in
> > __alloc_pages and because of this oom_killer also can't be moved to
> > page fault handler where I think its place should be. I'm using the
> > patch below.
> It's BROKEN.  This means that if you have one task using up
> all memory and you're waiting for the OOM kill of that task
> to have effect, your syslogd, etc... will have their allocations
> fail and will die.

You mean without dropping out_of_memory() test in kswapd and calling
oom_kill() in page fault [i.e. without additional patch]? Yes, you're
competely true but I have the patch [see example below, 'm1' is the bad
guy] just didn't have time to extensively test it and don't know whether
there is side efffects getting rid of this infinite looping in
__alloc_pages() but locked up processes apparently don't make people
very happy ;)

Szaka

Out of Memory: Killed process 830 (m1), saved process 696 (httpd)
   procs  memoryswap  io system
 r  b  w   swpd   free   buff  cache  si  sobibo   incs
 6  0  0  0   9492100   1496   0   0  1386 2 2904  3877
 5  0  0  0   7812104   1788   0   0   289 0  68922
 5  0  0  0   6248104   1788   0   0 0 0  10819
 5  0  0  0   4748108   1840   0   056 0  21921
 5  0  0  0   3268108   1868   0   028 0  16523
 5  0  1  0   1864 76   1868   0   0 0 5  12061
 5  0  1  0   1432 76   1252   0   0 0 0  108  1130
 5  0  1  0   1236 80796   0   065 0  246  4588
 5  0  1  0   1236 80668   0   0 0 0  110  8869
 6  0  1  0948112696   0   0   805 0 1814  8231
Out of Memory: Killed process 858 (m1), saved process 811 (vmstat)
 5  0  1  0924152444   0   0  1153 0 2731 18231
 4  0  1  0   1720148828   0   0   750 3 1711  1876
 5  0  1  0   1156148760   0   0   290 0  723  1967
 4  0  1  0   1152132664   0   070 0  277  7249
 4  0  1  0   1140144560   0   054 0  238  7942
 4  0  1  0   1140144460   0   032 0  212  7521
Out of Memory: Killed process 834 (m1), saved process 418 (identd)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scheduler went mad?

2001-04-12 Thread Szabolcs Szakacsits


On Thu, 12 Apr 2001, Marcelo Tosatti wrote:

> This patch is broken, ignore it.
> Just removing wakeup_bdflush() is indeed correct.
> We already wakeup bdflush at try_to_free_buffers() anyway.

I still feel a bit unconfortable about processes looping forever in
__alloc_pages and because of this oom_killer also can't be moved to page
fault handler where I think its place should be. I'm using the patch
below.

Szaka

--- mm/page_alloc.c.orig  Sat Mar 31 19:07:22 2001
+++ mm/page_alloc.c Mon Apr  2 21:05:31 2001
@@ -453,8 +453,12 @@
 */
if (gfp_mask & __GFP_WAIT) {
memory_pressure++;
-   try_to_free_pages(gfp_mask);
-   wakeup_bdflush(0);
+   if (!try_to_free_pages(gfp_mask));
+   return NULL;
goto try_again;
}
}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scheduler went mad?

2001-04-12 Thread Szabolcs Szakacsits


On Thu, 12 Apr 2001, Marcelo Tosatti wrote:

 This patch is broken, ignore it.
 Just removing wakeup_bdflush() is indeed correct.
 We already wakeup bdflush at try_to_free_buffers() anyway.

I still feel a bit unconfortable about processes looping forever in
__alloc_pages and because of this oom_killer also can't be moved to page
fault handler where I think its place should be. I'm using the patch
below.

Szaka

--- mm/page_alloc.c.orig  Sat Mar 31 19:07:22 2001
+++ mm/page_alloc.c Mon Apr  2 21:05:31 2001
@@ -453,8 +453,12 @@
 */
if (gfp_mask  __GFP_WAIT) {
memory_pressure++;
-   try_to_free_pages(gfp_mask);
-   wakeup_bdflush(0);
+   if (!try_to_free_pages(gfp_mask));
+   return NULL;
goto try_again;
}
}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scheduler went mad?

2001-04-12 Thread Szabolcs Szakacsits


On Thu, 12 Apr 2001, Rik van Riel wrote:
 On Thu, 12 Apr 2001, Szabolcs Szakacsits wrote:
  I still feel a bit unconfortable about processes looping forever in
  __alloc_pages and because of this oom_killer also can't be moved to
  page fault handler where I think its place should be. I'm using the
  patch below.
 It's BROKEN.  This means that if you have one task using up
 all memory and you're waiting for the OOM kill of that task
 to have effect, your syslogd, etc... will have their allocations
 fail and will die.

You mean without dropping out_of_memory() test in kswapd and calling
oom_kill() in page fault [i.e. without additional patch]? Yes, you're
competely true but I have the patch [see example below, 'm1' is the bad
guy] just didn't have time to extensively test it and don't know whether
there is side efffects getting rid of this infinite looping in
__alloc_pages() but locked up processes apparently don't make people
very happy ;)

Szaka

Out of Memory: Killed process 830 (m1), saved process 696 (httpd)
   procs  memoryswap  io system
 r  b  w   swpd   free   buff  cache  si  sobibo   incs
 6  0  0  0   9492100   1496   0   0  1386 2 2904  3877
 5  0  0  0   7812104   1788   0   0   289 0  68922
 5  0  0  0   6248104   1788   0   0 0 0  10819
 5  0  0  0   4748108   1840   0   056 0  21921
 5  0  0  0   3268108   1868   0   028 0  16523
 5  0  1  0   1864 76   1868   0   0 0 5  12061
 5  0  1  0   1432 76   1252   0   0 0 0  108  1130
 5  0  1  0   1236 80796   0   065 0  246  4588
 5  0  1  0   1236 80668   0   0 0 0  110  8869
 6  0  1  0948112696   0   0   805 0 1814  8231
Out of Memory: Killed process 858 (m1), saved process 811 (vmstat)
 5  0  1  0924152444   0   0  1153 0 2731 18231
 4  0  1  0   1720148828   0   0   750 3 1711  1876
 5  0  1  0   1156148760   0   0   290 0  723  1967
 4  0  1  0   1152132664   0   070 0  277  7249
 4  0  1  0   1140144560   0   054 0  238  7942
 4  0  1  0   1140144460   0   032 0  212  7521
Out of Memory: Killed process 834 (m1), saved process 418 (identd)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: scheduler went mad?

2001-04-12 Thread Szabolcs Szakacsits


On Thu, 12 Apr 2001, Rik van Riel wrote:
 On Thu, 12 Apr 2001, Szabolcs Szakacsits wrote:
  You mean without dropping out_of_memory() test in kswapd and calling
  oom_kill() in page fault [i.e. without additional patch]?
 No.  I think it's ok for __alloc_pages() to call oom_kill()
 IF we turn out to be out of memory, but that should not even
 be needed.

Not __alloc_pages() calls oom_kill() however do_page_fault(). Not the
same. After the system tried *really* hard to get *one* free page and
couldn't managed why loop forever? To eat CPU and waiting for
out_of_memory() to *guess* when system is in OOM? I don't think so, if
processes can't progress because system can't page in any of their
pages, somebody must go.

 Also, when a task in __alloc_pages() is OOM-killed, it will
 have PF_MEMALLOC set and will immediately break out of the
 loop. The rest of the system will spin around in the loop
 until the victim has exited and then their allocations will
 succeed.

Yes, I think this is a problem. In page fault if OOM, "bad" process
selected, scheduled, killed and everybody runs happily even without to
notice system is low on memory. Fast and gracious process killing
instead of slow, painful death IF out_of_memory() correctly detects OOM.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: pcnet32 (maybe more) hosed in 2.4.3

2001-03-31 Thread Szabolcs Szakacsits


On Fri, 30 Mar 2001, Scott G. Miller wrote:

> Linux 2.4.3, Debian Woody.  2.4.2 works without problems.  However, in
> 2.4.3, pcnet32 loads, gives an error message:

2.4.3 (and -ac's) are also broken as guest in VMWware due to the pcnet32
changes [doing 32 bit IO on 16 bit regs on the 79C970A controller].
Reverting this part of patch-2.4.3 below made things work again.

Szaka

@@ -528,11 +535,13 @@
 pcnet32_dwio_reset(ioaddr);
 pcnet32_wio_reset(ioaddr);

-if (pcnet32_wio_read_csr (ioaddr, 0) == 4 && pcnet32_wio_check (ioaddr)) {
-   a = _wio;
+/* Important to do the check for dwio mode first. */
+if (pcnet32_dwio_read_csr(ioaddr, 0) == 4 && pcnet32_dwio_check(ioaddr)) {
+a = _dwio;
 } else {
-   if (pcnet32_dwio_read_csr (ioaddr, 0) == 4 && pcnet32_dwio_check(ioaddr)) {
-   a = _dwio;
+if (pcnet32_wio_read_csr(ioaddr, 0) == 4 &&
+   pcnet32_wio_check(ioaddr)) {
+   a = _wio;
} else
return -ENODEV;
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: pcnet32 (maybe more) hosed in 2.4.3

2001-03-31 Thread Szabolcs Szakacsits


On Fri, 30 Mar 2001, Scott G. Miller wrote:

 Linux 2.4.3, Debian Woody.  2.4.2 works without problems.  However, in
 2.4.3, pcnet32 loads, gives an error message:

2.4.3 (and -ac's) are also broken as guest in VMWware due to the pcnet32
changes [doing 32 bit IO on 16 bit regs on the 79C970A controller].
Reverting this part of patch-2.4.3 below made things work again.

Szaka

@@ -528,11 +535,13 @@
 pcnet32_dwio_reset(ioaddr);
 pcnet32_wio_reset(ioaddr);

-if (pcnet32_wio_read_csr (ioaddr, 0) == 4  pcnet32_wio_check (ioaddr)) {
-   a = pcnet32_wio;
+/* Important to do the check for dwio mode first. */
+if (pcnet32_dwio_read_csr(ioaddr, 0) == 4  pcnet32_dwio_check(ioaddr)) {
+a = pcnet32_dwio;
 } else {
-   if (pcnet32_dwio_read_csr (ioaddr, 0) == 4  pcnet32_dwio_check(ioaddr)) {
-   a = pcnet32_dwio;
+if (pcnet32_wio_read_csr(ioaddr, 0) == 4 
+   pcnet32_wio_check(ioaddr)) {
+   a = pcnet32_wio;
} else
return -ENODEV;
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-29 Thread Szabolcs Szakacsits


On Thu, 29 Mar 2001, Dr. Michael Weller wrote:

> Applications forking and then dirtying their shared data pages
> madly? OOps.. nothing.. Why? It cannot be done!

In eager mode Solaris, Tru64, Irix, non-overcommit patch for Linux by
Eduardo Horvath from last year can do (you get ENOMEM at fork).

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-29 Thread Szabolcs Szakacsits


On Thu, 29 Mar 2001, Dr. Michael Weller wrote:
> On Thu, 29 Mar 2001, Szabolcs Szakacsits wrote:
> > The point is AIX *can* guarantee [even for an ordinary process] that
> > your signal handler will be executed, Linux can *not*. It doesn't matter
> No it can't... and the reason is...

So AIX is buggy in eager mode not reserving a couple of extra pages [per
process] to be able to run the handler. What AIX version(s) you use?
Anyway, as you probably noticed at present I'm not a big supporter of
introducing SIGDANGER, too many things can be messed up for little
or no gain.

> Note that there are nasty users like me, which provide a no_op function
> as SIGDANGER handler.

For example this.

> Joe blow user can code a SIGDANGER exploiting prog that will kill the
> whole concept by allocating memory in SIGDANGER.

And this. Moreover it shouldn't be malicious, people write happily
sighandlers that would blowup thing even without they realise ...

And admin still have no control over the things ;) Sure it could be
worked around these but I feel it just doesn't worth for the added
complexity.

> About this early alloction myths: Did you actually read the page?
> The fact its controlled by a silly environment variable shows it
> is a mere user space issue.

This is my question as well ;) Although I didn't read the AIX source but
guessed kernel sets a bit in the task structure for eager mode during
the exec() syscall and takes care about everything, at least this is
what the document suggests ;) [see the bottom of the page]

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-29 Thread Szabolcs Szakacsits


On Thu, 29 Mar 2001, Dr. Michael Weller wrote:
> On Wed, 28 Mar 2001, Andreas Dilger wrote:
> > Szaka writes:
> > > And every time the SIGDANGER comes up, the issue that AIX provides
> > > *both* early and late allocation mechanism even on per-process basis
> > > that can be controlled by *both* the programmer and the admin is
> > > completely ignored. Linux supports none of these...
> Maybe some details here were helpful.

http://www.unet.univie.ac.at/aix/aixbman/baseadmn/pag_space_under.htm

> > > ...with the current model it's quite possible the handler code is still
> > > sitting on the disk and no memory will be available to page it in.
> > Actually, I see SIGDANGER being useful at the time we hit freepages.min

The point is AIX *can* guarantee [even for an ordinary process] that
your signal handler will be executed, Linux can *not*. It doesn't matter
where the different oom watermarks are, there would be always such
situations when your handler would get the control it's already far too
late [because between sending SIGDANGER and app getting the control (you
can't schedule e.g. 1000 apps at the same time) the system run into oom
and killed just your app (and e.g. the other 999 buggy mem leaking app
registered a no-op SIGDANGER handler), hope you get the picture even
the example is highly unrealistic].

> > (or maybe even freepages.low), rather than actual OOM so that the apps
> > have a chance to DO something about the memory shortage,

Primarily *users* should have a chance to control this thing, not
developers and kernel. The laters should provide a way to control things
and have a reasonable default [Linux already has the latest but not the
former]. Guess what one wants to be killed if he runs Oracle in
production and DB2, Informix, Sybase, etc in trial. Now only kernel
decides. With SIGDANDER also only developers/kernel would decide [now
forget about resource management that would prevent running all of them
on the same box, Linux users want to fully utilize the box ;)].

So again, IMHO to address this long standing problem, Linux needs
- optional non-overcommit, with per-process granularity would be nice
  [and leave the default just as-is now], oom killer could weight also
  based on this info additionally
- reserved/quaranteed superuser memory [otherwise in non-overcommit mode
  in user space oom(=system oom), oom killer would just take action]
- and as a last chance not to deadlock, advisory oom killer with
  reasonable default [the current default is pretty fine - apart from
  its current bugs]
- a HOWTO about preventing OOM, killing your important processes, etc

Later on virtual swap space [the degree of memory overcommitment] and
SIGDANGER maybe would be useful but I don't think so at present.

BTW, the issue is far more difficult as some people, let's "fix malloc
and its friends" think.

> If freepages.min is reached, AIX starts to kill processes (just like OOM
> killer). It uses some heuristics which might be better than our, but I
> doubt it.

If every process runs in non-overcommit mode AIX kills init first :)

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-29 Thread Szabolcs Szakacsits


On Thu, 29 Mar 2001, Dr. Michael Weller wrote:
 On Wed, 28 Mar 2001, Andreas Dilger wrote:
  Szaka writes:
   And every time the SIGDANGER comes up, the issue that AIX provides
   *both* early and late allocation mechanism even on per-process basis
   that can be controlled by *both* the programmer and the admin is
   completely ignored. Linux supports none of these...
 Maybe some details here were helpful.

http://www.unet.univie.ac.at/aix/aixbman/baseadmn/pag_space_under.htm

   ...with the current model it's quite possible the handler code is still
   sitting on the disk and no memory will be available to page it in.
  Actually, I see SIGDANGER being useful at the time we hit freepages.min

The point is AIX *can* guarantee [even for an ordinary process] that
your signal handler will be executed, Linux can *not*. It doesn't matter
where the different oom watermarks are, there would be always such
situations when your handler would get the control it's already far too
late [because between sending SIGDANGER and app getting the control (you
can't schedule e.g. 1000 apps at the same time) the system run into oom
and killed just your app (and e.g. the other 999 buggy mem leaking app
registered a no-op SIGDANGER handler), hope you get the picture even
the example is highly unrealistic].

  (or maybe even freepages.low), rather than actual OOM so that the apps
  have a chance to DO something about the memory shortage,

Primarily *users* should have a chance to control this thing, not
developers and kernel. The laters should provide a way to control things
and have a reasonable default [Linux already has the latest but not the
former]. Guess what one wants to be killed if he runs Oracle in
production and DB2, Informix, Sybase, etc in trial. Now only kernel
decides. With SIGDANDER also only developers/kernel would decide [now
forget about resource management that would prevent running all of them
on the same box, Linux users want to fully utilize the box ;)].

So again, IMHO to address this long standing problem, Linux needs
- optional non-overcommit, with per-process granularity would be nice
  [and leave the default just as-is now], oom killer could weight also
  based on this info additionally
- reserved/quaranteed superuser memory [otherwise in non-overcommit mode
  in user space oom(=system oom), oom killer would just take action]
- and as a last chance not to deadlock, advisory oom killer with
  reasonable default [the current default is pretty fine - apart from
  its current bugs]
- a HOWTO about preventing OOM, killing your important processes, etc

Later on virtual swap space [the degree of memory overcommitment] and
SIGDANGER maybe would be useful but I don't think so at present.

BTW, the issue is far more difficult as some people, let's "fix malloc
and its friends" think.

 If freepages.min is reached, AIX starts to kill processes (just like OOM
 killer). It uses some heuristics which might be better than our, but I
 doubt it.

If every process runs in non-overcommit mode AIX kills init first :)

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-29 Thread Szabolcs Szakacsits


On Thu, 29 Mar 2001, Dr. Michael Weller wrote:
 On Thu, 29 Mar 2001, Szabolcs Szakacsits wrote:
  The point is AIX *can* guarantee [even for an ordinary process] that
  your signal handler will be executed, Linux can *not*. It doesn't matter
 No it can't... and the reason is...

So AIX is buggy in eager mode not reserving a couple of extra pages [per
process] to be able to run the handler. What AIX version(s) you use?
Anyway, as you probably noticed at present I'm not a big supporter of
introducing SIGDANGER, too many things can be messed up for little
or no gain.

 Note that there are nasty users like me, which provide a no_op function
 as SIGDANGER handler.

For example this.

 Joe blow user can code a SIGDANGER exploiting prog that will kill the
 whole concept by allocating memory in SIGDANGER.

And this. Moreover it shouldn't be malicious, people write happily
sighandlers that would blowup thing even without they realise ...

And admin still have no control over the things ;) Sure it could be
worked around these but I feel it just doesn't worth for the added
complexity.

 About this early alloction myths: Did you actually read the page?
 The fact its controlled by a silly environment variable shows it
 is a mere user space issue.

This is my question as well ;) Although I didn't read the AIX source but
guessed kernel sets a bit in the task structure for eager mode during
the exec() syscall and takes care about everything, at least this is
what the document suggests ;) [see the bottom of the page]

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-29 Thread Szabolcs Szakacsits


On Thu, 29 Mar 2001, Dr. Michael Weller wrote:

 Applications forking and then dirtying their shared data pages
 madly? OOps.. nothing.. Why? It cannot be done!

In eager mode Solaris, Tru64, Irix, non-overcommit patch for Linux by
Eduardo Horvath from last year can do (you get ENOMEM at fork).

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-28 Thread Szabolcs Szakacsits


On Tue, 27 Mar 2001, Rogier Wolff wrote:

> Out of Memory: Killed process 117 (sendmail).
[ ... many of these ... ]
> Out of Memory: Killed process 117 (sendmail).
>
> What we did to run it out of memory, I don't know. But I do know that
> it shouldn't be killing one process more than once... (the process
> should not exist after one try...)

I already noted this last week. Processes in TASK_UNINTERRUPTIBLE state
can't be scheduled so won't be killed immediately. This state can be
also permanent if the process using [buggy?] smbfs, nfs without the
'hard,intr' option, buggy drivers or hardwares. What worse, if this
state is permanent, a lockup is guaranteed [the other, random OOM killer
in page fault handler never gets the chance to run for some mysterious
reasons (it worked fine in 2.2)]. Solution is easy, one bit in task
structure should indicate that the process already SIGKILL'ed ... oops
but it must be already there, so it should be just taken into account by
OOM killer. Hopefully it won't result a massacre ... [that would be
still better then a lockup, wouldn't be ;)]

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-28 Thread Szabolcs Szakacsits


On Tue, 27 Mar 2001, Andreas Dilger wrote:

> Every time this subject comes up, I point to AIX and SIGDANGER - a signal
> sent to processes when the system gets OOM.

And every time the SIGDANGER comes up, the issue that AIX provides
*both* early and late allocation mechanism even on per-process basis
that can be controlled by *both* the programmer and the admin is
completely ignored. Linux supports none of these and with the current
model it's quite possible the handler code is still sitting on the disk
and no memory will be available to page it in. Or do you want to see
more apps running as setuid-root and mlocking the handler wasting useful
memory and opening even more window for security exploits in the future?
And even using capabilities instead of setuid-root, only developers
could influence the behavior, not admins who must operate the box. No,
at present the SIGDANGER bloat would be just a fake excuse but wouldn't
address the root of problems at all.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-28 Thread Szabolcs Szakacsits


On Tue, 27 Mar 2001, Andreas Dilger wrote:

 Every time this subject comes up, I point to AIX and SIGDANGER - a signal
 sent to processes when the system gets OOM.

And every time the SIGDANGER comes up, the issue that AIX provides
*both* early and late allocation mechanism even on per-process basis
that can be controlled by *both* the programmer and the admin is
completely ignored. Linux supports none of these and with the current
model it's quite possible the handler code is still sitting on the disk
and no memory will be available to page it in. Or do you want to see
more apps running as setuid-root and mlocking the handler wasting useful
memory and opening even more window for security exploits in the future?
And even using capabilities instead of setuid-root, only developers
could influence the behavior, not admins who must operate the box. No,
at present the SIGDANGER bloat would be just a fake excuse but wouldn't
address the root of problems at all.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: OOM killer???

2001-03-28 Thread Szabolcs Szakacsits


On Tue, 27 Mar 2001, Rogier Wolff wrote:

 Out of Memory: Killed process 117 (sendmail).
[ ... many of these ... ]
 Out of Memory: Killed process 117 (sendmail).

 What we did to run it out of memory, I don't know. But I do know that
 it shouldn't be killing one process more than once... (the process
 should not exist after one try...)

I already noted this last week. Processes in TASK_UNINTERRUPTIBLE state
can't be scheduled so won't be killed immediately. This state can be
also permanent if the process using [buggy?] smbfs, nfs without the
'hard,intr' option, buggy drivers or hardwares. What worse, if this
state is permanent, a lockup is guaranteed [the other, random OOM killer
in page fault handler never gets the chance to run for some mysterious
reasons (it worked fine in 2.2)]. Solution is easy, one bit in task
structure should indicate that the process already SIGKILL'ed ... oops
but it must be already there, so it should be just taken into account by
OOM killer. Hopefully it won't result a massacre ... [that would be
still better then a lockup, wouldn't be ;)]

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-25 Thread Szabolcs Szakacsits


On Sat, 24 Mar 2001, Jesse Pollard wrote:
> On Fri, 23 Mar 2001, Alan Cox wrote:
[  about non-overcommit  ]
> > Nobody feels its very important because nobody has implemented it.

Enterprises use other systems because they have much better resource
management than Linux -- adding non-overcommit wouldn't help them much.
Desktop users, Linux newbies don't understand what's
eager/early/non-overcommit vs lazy/late/overcommit memory management
[just see these threads here if you aren't bored already enough ;)] and
even if they do at last they don't have the ability to implement it. And
between them, people are mostly fine with ulimit.

> Small correction - It was implemented, just not included in the standard
> kernel.

Please note, adding optional non-overcommit also wouldn't help much
without guaranteed/reserved resources [e.g. you are OOM -> appps, users
complain, admin login in and BANG OOM killer just killed one of the
jobs]. This was one of the reasons I made the reserved root memory
patch [this is also the way other OS'es do]. Now just the different
patches should be merged and write an OOM FAQ for users how to avoid,
control, etc it].

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-25 Thread Szabolcs Szakacsits


On Sat, 24 Mar 2001, Jesse Pollard wrote:
 On Fri, 23 Mar 2001, Alan Cox wrote:
[  about non-overcommit  ]
  Nobody feels its very important because nobody has implemented it.

Enterprises use other systems because they have much better resource
management than Linux -- adding non-overcommit wouldn't help them much.
Desktop users, Linux newbies don't understand what's
eager/early/non-overcommit vs lazy/late/overcommit memory management
[just see these threads here if you aren't bored already enough ;)] and
even if they do at last they don't have the ability to implement it. And
between them, people are mostly fine with ulimit.

 Small correction - It was implemented, just not included in the standard
 kernel.

Please note, adding optional non-overcommit also wouldn't help much
without guaranteed/reserved resources [e.g. you are OOM - appps, users
complain, admin login in and BANG OOM killer just killed one of the
jobs]. This was one of the reasons I made the reserved root memory
patch [this is also the way other OS'es do]. Now just the different
patches should be merged and write an OOM FAQ for users how to avoid,
control, etc it].

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Fri, 23 Mar 2001, Alan Cox wrote:
> > > and rely on it. You might find you need a few Gbytes of swap just to
> > > boot
> > Seems a bit exaggeration ;) Here are numbers,
> NetBSD is if I remember rightly still using a.out library styles.

No, it uses ELF today, moreover the numbers were from Solaris. NetBSD
also switched from non-overcommit to overcommit-only [AFAIK] mode with
"random" process killing with its new UVM.

> > 6-50% more VM and the performance hit also isn't so bad as it's thought
> > (Eduardo Horvath sent a non-overcommit patch for Linux about one year
> > ago).
> The Linux performance hit would be so close to zero you shouldnt be able to
> measure it - or it was in 1.2 anyway

Yep, something like this :)

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Fri, 23 Mar 2001, Paul Jakma wrote:
> On Fri, 23 Mar 2001, Szabolcs Szakacsits wrote:
> > About the "use resource limits!". Yes, this is one solution. The
> > *expensive* solution (admin time, worse resource utilization, etc).

Thanks for cutting out relevant parts that said how to increase user
base and satisfaction keeping and using the existent possibility as
well.

> traditional user limits have worse resource utilisation? think what
> kind of utilisation a guaranteed allocation system would have. instead
> of 128MB, you'd need maybe a GB of RAM and many many GB of swap for
> most systems.

Nonsense hodgepodge. See and/or mesaure the impact. I sent numbers in my
former email. You also missed non-overcommit must be _optional_ [i.e.
you wouldn't be forced to use it ;)]. Yes, there are users and
enterprises who require it and would happily pay the 50-100% extra swap
space for the same workload and extra reliability.

> - setting up limits on a RH system takes 1 minute by editing
> /etc/security/limits.conf.

At every time you add/delete users, add/delete special apps, etc.
Please note again, some people wants this way, some only for sometimes,
and others really don't care because system guarantees for the admins
they will always have the resources to take action [unfortunately this
is not Linux].

> - Rik's current oom killer may not do a good job now, but it's
> impossible for it to do a /perfect/ job without implementing
> kernel/esp.c.

Rik's killer is quite fine at _default_. But there will be always people
who won't like it [the bastards think humans can still make better
decisions than machines]. Wouldn't it be win for both sides if you could
point out, "Hey, if you don't like the default, use the
/proc/sys/vm/oom_killer interface"? As I said before there are also
such patch by Chris Swiedler and definitely not a huge, complex one.
And these stupid threads could be forgotten for good and all.

> - with limits set you will have:
>  - /possible/ underutilisation on some workloads.

Depends, guaranteed underutilisation or guaranteed extra unreliability
fit the picture many times as well.

> no matter how good or bad Rik's killer is, i'd much rather set limits
> and just about /never/ have it invoked.

Thanks for expressing your opinion but others [not necessarily me] have
"occasionally" other one depending on the job what the box must do.

Szaka


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Thu, 22 Mar 2001, Alan Cox wrote:

> I'd like to have it there as an option. As to the default - You
> would have to see how much applications assume they can overcommit
> and rely on it. You might find you need a few Gbytes of swap just to
> boot

Seems a bit exaggeration ;) Here are numbers,

http://lists.openresources.com/NetBSD/tech-userlevel/msg00722.html

6-50% more VM and the performance hit also isn't so bad as it's thought
(Eduardo Horvath sent a non-overcommit patch for Linux about one year
ago).

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Thu, 22 Mar 2001, Guest section DW wrote:
> Presently however, a flawless program can be killed.
> That is what makes Linux unreliable.

Your advocation is "save the application, crash the OS!". But you can't
be blamed because everybody's first reaction is this :) But if you start
to think you get the conclusion that process killing can't be avoided if
you want the system keep running. But I agree Linux lacks some important
things [see my other email] that could make the situation easily and
inexpensively controllable.

BTW, your app isn't flawless because it doesn't consider Linux memory
management is [quasi-]overcommit-only at present ;) [or you used other
apps as well, e.g. login, ps, cron is enough to kill your app when it
stopped at OOM time].

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Thu, 22 Mar 2001, Alan Cox wrote:

> One of the things that we badly need to resurrect for 2.5 is the
> beancounter work which would let you reasonably do things like
> guaranteed Oracle a certain amount of the machine, or restrict all
> the untrusted users to a total of 200Mb hard limit between them etc

This would improve Linux reliability but it could be much better with
added *optional* non-overcommit (most other OS also support this, also
that's the default mostly [please no, "but it deadlocks" because it's
not true, they also kill processes (Solaris, etc)]), reserved superuser
memory (ala Solaris, True64, etc when OOM in non-overcommit, users
complain and superuser acts, not the OS killing their tasks) and
superuser *advisory* OOM killer [there was patch for this before], I
think in the last area Linux is already more ahead than others at
present.

About the "use resource limits!". Yes, this is one solution. The
*expensive* solution (admin time, worse resource utilization, etc).
Others make it cheaper mixing with the above ones.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Wed, 21 Mar 2001, Rik van Riel wrote:
> One question ... has the OOM killer ever selected init on
> anybody's system ?

Hi Rik,

When I ported your OOM killer to 2.2.x and integrated it into the
'reserved root memory' [*] patch, during intensive testing I found two
cases when init was killed. It happened on low-end machines and when OOM
killer wasn't triggered so init was killed in the page fault handler.
The later was also one of the reasons I replaced the "random" OOM killer
in page fault handler with yours [so there is only one OOM killer]. I
also asked you at that time whether there was any reason you didn't put
it also there but unfortunately you didn't answer. Practice showed it
works there as well [and actually some crashes that was reported here
recently could have been avoided in this way] but technically maybe I
missed something?

Other things that bothered me,
 - niced processes are penalized
 - trying to kill a task that is permanently in TASK_UNINTERRUPTIBLE
   will probably deadlock the machine [or the random OOM killer will
   kill the box].

Szaka

[*] who are interested, it can be found at
http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_memory.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Wed, 21 Mar 2001, Rik van Riel wrote:
 One question ... has the OOM killer ever selected init on
 anybody's system ?

Hi Rik,

When I ported your OOM killer to 2.2.x and integrated it into the
'reserved root memory' [*] patch, during intensive testing I found two
cases when init was killed. It happened on low-end machines and when OOM
killer wasn't triggered so init was killed in the page fault handler.
The later was also one of the reasons I replaced the "random" OOM killer
in page fault handler with yours [so there is only one OOM killer]. I
also asked you at that time whether there was any reason you didn't put
it also there but unfortunately you didn't answer. Practice showed it
works there as well [and actually some crashes that was reported here
recently could have been avoided in this way] but technically maybe I
missed something?

Other things that bothered me,
 - niced processes are penalized
 - trying to kill a task that is permanently in TASK_UNINTERRUPTIBLE
   will probably deadlock the machine [or the random OOM killer will
   kill the box].

Szaka

[*] who are interested, it can be found at
http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_memory.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Thu, 22 Mar 2001, Alan Cox wrote:

 One of the things that we badly need to resurrect for 2.5 is the
 beancounter work which would let you reasonably do things like
 guaranteed Oracle a certain amount of the machine, or restrict all
 the untrusted users to a total of 200Mb hard limit between them etc

This would improve Linux reliability but it could be much better with
added *optional* non-overcommit (most other OS also support this, also
that's the default mostly [please no, "but it deadlocks" because it's
not true, they also kill processes (Solaris, etc)]), reserved superuser
memory (ala Solaris, True64, etc when OOM in non-overcommit, users
complain and superuser acts, not the OS killing their tasks) and
superuser *advisory* OOM killer [there was patch for this before], I
think in the last area Linux is already more ahead than others at
present.

About the "use resource limits!". Yes, this is one solution. The
*expensive* solution (admin time, worse resource utilization, etc).
Others make it cheaper mixing with the above ones.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Thu, 22 Mar 2001, Alan Cox wrote:

 I'd like to have it there as an option. As to the default - You
 would have to see how much applications assume they can overcommit
 and rely on it. You might find you need a few Gbytes of swap just to
 boot

Seems a bit exaggeration ;) Here are numbers,

http://lists.openresources.com/NetBSD/tech-userlevel/msg00722.html

6-50% more VM and the performance hit also isn't so bad as it's thought
(Eduardo Horvath sent a non-overcommit patch for Linux about one year
ago).

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Thu, 22 Mar 2001, Guest section DW wrote:
 Presently however, a flawless program can be killed.
 That is what makes Linux unreliable.

Your advocation is "save the application, crash the OS!". But you can't
be blamed because everybody's first reaction is this :) But if you start
to think you get the conclusion that process killing can't be avoided if
you want the system keep running. But I agree Linux lacks some important
things [see my other email] that could make the situation easily and
inexpensively controllable.

BTW, your app isn't flawless because it doesn't consider Linux memory
management is [quasi-]overcommit-only at present ;) [or you used other
apps as well, e.g. login, ps, cron is enough to kill your app when it
stopped at OOM time].

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Fri, 23 Mar 2001, Paul Jakma wrote:
 On Fri, 23 Mar 2001, Szabolcs Szakacsits wrote:
  About the "use resource limits!". Yes, this is one solution. The
  *expensive* solution (admin time, worse resource utilization, etc).

Thanks for cutting out relevant parts that said how to increase user
base and satisfaction keeping and using the existent possibility as
well.

 traditional user limits have worse resource utilisation? think what
 kind of utilisation a guaranteed allocation system would have. instead
 of 128MB, you'd need maybe a GB of RAM and many many GB of swap for
 most systems.

Nonsense hodgepodge. See and/or mesaure the impact. I sent numbers in my
former email. You also missed non-overcommit must be _optional_ [i.e.
you wouldn't be forced to use it ;)]. Yes, there are users and
enterprises who require it and would happily pay the 50-100% extra swap
space for the same workload and extra reliability.

 - setting up limits on a RH system takes 1 minute by editing
 /etc/security/limits.conf.

At every time you add/delete users, add/delete special apps, etc.
Please note again, some people wants this way, some only for sometimes,
and others really don't care because system guarantees for the admins
they will always have the resources to take action [unfortunately this
is not Linux].

 - Rik's current oom killer may not do a good job now, but it's
 impossible for it to do a /perfect/ job without implementing
 kernel/esp.c.

Rik's killer is quite fine at _default_. But there will be always people
who won't like it [the bastards think humans can still make better
decisions than machines]. Wouldn't it be win for both sides if you could
point out, "Hey, if you don't like the default, use the
/proc/sys/vm/oom_killer interface"? As I said before there are also
such patch by Chris Swiedler and definitely not a huge, complex one.
And these stupid threads could be forgotten for good and all.

 - with limits set you will have:
  - /possible/ underutilisation on some workloads.

Depends, guaranteed underutilisation or guaranteed extra unreliability
fit the picture many times as well.

 no matter how good or bad Rik's killer is, i'd much rather set limits
 and just about /never/ have it invoked.

Thanks for expressing your opinion but others [not necessarily me] have
"occasionally" other one depending on the job what the box must do.

Szaka


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread Szabolcs Szakacsits


On Fri, 23 Mar 2001, Alan Cox wrote:
   and rely on it. You might find you need a few Gbytes of swap just to
   boot
  Seems a bit exaggeration ;) Here are numbers,
 NetBSD is if I remember rightly still using a.out library styles.

No, it uses ELF today, moreover the numbers were from Solaris. NetBSD
also switched from non-overcommit to overcommit-only [AFAIK] mode with
"random" process killing with its new UVM.

  6-50% more VM and the performance hit also isn't so bad as it's thought
  (Eduardo Horvath sent a non-overcommit patch for Linux about one year
  ago).
 The Linux performance hit would be so close to zero you shouldnt be able to
 measure it - or it was in 1.2 anyway

Yep, something like this :)

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: system call for process information?

2001-03-14 Thread Szabolcs Szakacsits


On Wed, 14 Mar 2001, Alexander Viro wrote:
> On Wed, 14 Mar 2001, Szabolcs Szakacsits wrote:
> > read() doesn't really work for this purpose, it blocks way too many
> > times to be very annoying. When finally data arrives it's useless.
> Huh? Take code of your non-blocking syscall. Make it ->read() for
> relevant file on /proc or wherever else you want it. See read() not
> blocking...

Sorry I should have quoted "blocks". Problem isn't with blocking but
*no* data, no information. In the end you can conclude you know
*nothing* what happend in the last t time interval - this can be second,
minutes even with an RT, mlocked, etc process when the load is around 0.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: system call for process information?

2001-03-14 Thread Szabolcs Szakacsits


On Mon, 12 Mar 2001, Alexander Viro wrote:
> On Mon, 12 Mar 2001, Guennadi Liakhovetski wrote:
> > I need to collect some info on processes. One way is to read /proc
> > tree. But isn't there a system call (ioctl) for this? And what are those
> Occam's Razor.  Why invent new syscall when read() works?

read() doesn't really work for this purpose, it blocks way too many
times to be very annoying. When finally data arrives it's useless.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: system call for process information?

2001-03-14 Thread Szabolcs Szakacsits


On Mon, 12 Mar 2001, Alexander Viro wrote:
 On Mon, 12 Mar 2001, Guennadi Liakhovetski wrote:
  I need to collect some info on processes. One way is to read /proc
  tree. But isn't there a system call (ioctl) for this? And what are those
 Occam's Razor.  Why invent new syscall when read() works?

read() doesn't really work for this purpose, it blocks way too many
times to be very annoying. When finally data arrives it's useless.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: system call for process information?

2001-03-14 Thread Szabolcs Szakacsits


On Wed, 14 Mar 2001, Alexander Viro wrote:
 On Wed, 14 Mar 2001, Szabolcs Szakacsits wrote:
  read() doesn't really work for this purpose, it blocks way too many
  times to be very annoying. When finally data arrives it's useless.
 Huh? Take code of your non-blocking syscall. Make it -read() for
 relevant file on /proc or wherever else you want it. See read() not
 blocking...

Sorry I should have quoted "blocks". Problem isn't with blocking but
*no* data, no information. In the end you can conclude you know
*nothing* what happend in the last t time interval - this can be second,
minutes even with an RT, mlocked, etc process when the load is around 0.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux Disk Performance/File IO per process

2001-01-29 Thread Szabolcs Szakacsits


On Mon, 29 Jan 2001, Chris Evans wrote:

> Stephen Tweedie has a rather funky i/o stats enhancement patch which
> should provide what you need. It comes with RedHat7.0 and gives decent
> disk statistics in /proc/partitions.

Monitoring via /proc [not just IO but close to anything] has the
features:
 - slow, not atomic, not scalable
 - if kernel decides explicitely or due to a "bug" to refuse doing
   IO, you get something like this [even using a mlocked, RT monitor],
   procsmemoryswap  io system cpu
 r  b  w   swpd  free  buff  cache  si  sobibo   incs  us  sy  id
 0  1  1  27116  1048   736 152832 128 1972 2544   869   44  1812   2  43  55
 5  0  2  27768  1048   744 153372  52 1308 2668   777   43  1772   2  61  37
 0  2  1  28360  1048   752 153900 332 564  2311   955   49  2081   1  68  31

 1  7  2  28356  1048   752 153708 3936  0  2175 29091  494 27348   0   1  99
 1  0  2  28356  1048   792 153656 172   0  7166 0  144   838   4  17  80

In short, monitoring via /proc is unreliable.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux Disk Performance/File IO per process

2001-01-29 Thread Szabolcs Szakacsits


On Mon, 29 Jan 2001, Chris Evans wrote:

 Stephen Tweedie has a rather funky i/o stats enhancement patch which
 should provide what you need. It comes with RedHat7.0 and gives decent
 disk statistics in /proc/partitions.

Monitoring via /proc [not just IO but close to anything] has the
features:
 - slow, not atomic, not scalable
 - if kernel decides explicitely or due to a "bug" to refuse doing
   IO, you get something like this [even using a mlocked, RT monitor],
   procsmemoryswap  io system cpu
 r  b  w   swpd  free  buff  cache  si  sobibo   incs  us  sy  id
 0  1  1  27116  1048   736 152832 128 1972 2544   869   44  1812   2  43  55
 5  0  2  27768  1048   744 153372  52 1308 2668   777   43  1772   2  61  37
 0  2  1  28360  1048   752 153900 332 564  2311   955   49  2081   1  68  31
frozen
 1  7  2  28356  1048   752 153708 3936  0  2175 29091  494 27348   0   1  99
 1  0  2  28356  1048   792 153656 172   0  7166 0  144   838   4  17  80

In short, monitoring via /proc is unreliable.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.1pre8 slowdown on dbench tests

2001-01-18 Thread Szabolcs Szakacsits


On Fri, 19 Jan 2001, Jens Axboe wrote:
> On Fri, Jan 19 2001, Szabolcs Szakacsits wrote:
> > Redone with big enough swap by requests.
> > 2.4.0,132MB swap
> > 548.81user 128.97system11:22  99%CPU (442433major+705419minor)
> > 561.12user 171.06system12:29  97%CPU (446949major+712525minor)
> > 625.68user 2833.29system 1:12:38  79%CPU (638957major+1463974minor)
> > ===
> > 2.4.1pre8,132MB swap
> > 548.71user 117.93system11:09  99%CPU (442434major+705420minor)
> > 558.93user 166.82system12:20  98%CPU (446941major+712662minor)
> > 621.37user 2592.54system 1:07:33  79%CPU (592679major+1311442minor)
>
> Better, could you try with the number changes that Andrea suggested
> too? Thanks.

Helped intensive swapping a bit, degraded other cases [no or slight
swapping].

2.4.1pre8,32MB RAM,132MB swap,blk suggestion
544.19user 141.25system11:31  99%CPU (442419major+705411minor)
554.83user 191.57system12:41  98%CPU (445762major+710409minor)
612.05user 2551.37system 1:07:21  78%CPU (589623major+1313665minor)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.1pre8 slowdown on dbench tests

2001-01-18 Thread Szabolcs Szakacsits


Redone with big enough swap by requests.

2.4.0,132MB swap
548.81user 128.97system11:22  99%CPU (442433major+705419minor)
561.12user 171.06system12:29  97%CPU (446949major+712525minor)
625.68user 2833.29system 1:12:38  79%CPU (638957major+1463974minor)
===
2.4.1pre8,132MB swap
548.71user 117.93system11:09  99%CPU (442434major+705420minor)
558.93user 166.82system12:20  98%CPU (446941major+712662minor)
621.37user 2592.54system 1:07:33  79%CPU (592679major+1311442minor)

> Below some kernel compile numbers on a 32 MB RAM + 32 MB swap box. The
> three lines mean compilation with the -j1, -j2 and -j4 option. Most of
> the time 2.4.1pre8 was also unable to compile the kernel because cc1
> was killed by OOM handler.
>
> 2.2.18
> 548.27user 94.18system 10:50  98%CPU (450479major+696869minor)
> 548.94user 153.85system11:51  98%CPU (487111major+704948minor)
> 599.44user 2018.66system   51:47  84%CPU (2295045major+1182819minor)
> =
> 2.4.0
> 557.18user 121.57system11:25  99%CPU (442434major+705429minor)
> 551.76user 158.78system12:11  97%CPU (446183major+711572minor)
> 579.65user 2860.53system 1:05:45  87%CPU (650964major+1209969minor)
> ===
> 2.4.0+blk-13B
> 546.89user 140.35system11:33  99%CPU (442435major+705424minor)
> 570.73user 188.51system12:56  97%CPU (445171major+712791minor)
> 566.33user 2681.20system 1:02:26  86%CPU (654402major+1225784minor)
> =
> 2.4.1pre8
> 546.23user 118.81system11:09  99%CPU (442434major+705424minor)
> 569.12user 161.25system12:22  98%CPU (446667major+712457minor)
> 727.58user 2489.96system 1:25:34  62%CPU (616240major+1375321minor)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.1pre8 slowdown on dbench tests

2001-01-18 Thread Szabolcs Szakacsits


On Thu, 18 Jan 2001, Marcelo Tosatti wrote:

> On my dbench runs I've noted a slowdown between pre4 and pre8 with 48
> threads. (128MB, 2 CPU's machine)

Below some kernel compile numbers on a 32 MB RAM + 32 MB swap box. The
three lines mean compilation with the -j1, -j2 and -j4 option. Most of
the time 2.4.1pre8 was also unable to compile the kernel because cc1
was killed by OOM handler.

Szaka

2.2.18
548.27user 94.18system 10:50  98%CPU (450479major+696869minor)
548.94user 153.85system11:51  98%CPU (487111major+704948minor)
599.44user 2018.66system   51:47  84%CPU (2295045major+1182819minor)
=
2.4.0
557.18user 121.57system11:25  99%CPU (442434major+705429minor)
551.76user 158.78system12:11  97%CPU (446183major+711572minor)
579.65user 2860.53system 1:05:45  87%CPU (650964major+1209969minor)
===
2.4.0+blk-13B
546.89user 140.35system11:33  99%CPU (442435major+705424minor)
570.73user 188.51system12:56  97%CPU (445171major+712791minor)
566.33user 2681.20system 1:02:26  86%CPU (654402major+1225784minor)
=
2.4.1pre8
546.23user 118.81system11:09  99%CPU (442434major+705424minor)
569.12user 161.25system12:22  98%CPU (446667major+712457minor)
727.58user 2489.96system 1:25:34  62%CPU (616240major+1375321minor)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.1pre8 slowdown on dbench tests

2001-01-18 Thread Szabolcs Szakacsits


On Thu, 18 Jan 2001, Marcelo Tosatti wrote:

 On my dbench runs I've noted a slowdown between pre4 and pre8 with 48
 threads. (128MB, 2 CPU's machine)

Below some kernel compile numbers on a 32 MB RAM + 32 MB swap box. The
three lines mean compilation with the -j1, -j2 and -j4 option. Most of
the time 2.4.1pre8 was also unable to compile the kernel because cc1
was killed by OOM handler.

Szaka

2.2.18
548.27user 94.18system 10:50  98%CPU (450479major+696869minor)
548.94user 153.85system11:51  98%CPU (487111major+704948minor)
599.44user 2018.66system   51:47  84%CPU (2295045major+1182819minor)
=
2.4.0
557.18user 121.57system11:25  99%CPU (442434major+705429minor)
551.76user 158.78system12:11  97%CPU (446183major+711572minor)
579.65user 2860.53system 1:05:45  87%CPU (650964major+1209969minor)
===
2.4.0+blk-13B
546.89user 140.35system11:33  99%CPU (442435major+705424minor)
570.73user 188.51system12:56  97%CPU (445171major+712791minor)
566.33user 2681.20system 1:02:26  86%CPU (654402major+1225784minor)
=
2.4.1pre8
546.23user 118.81system11:09  99%CPU (442434major+705424minor)
569.12user 161.25system12:22  98%CPU (446667major+712457minor)
727.58user 2489.96system 1:25:34  62%CPU (616240major+1375321minor)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.1pre8 slowdown on dbench tests

2001-01-18 Thread Szabolcs Szakacsits


Redone with big enough swap by requests.

2.4.0,132MB swap
548.81user 128.97system11:22  99%CPU (442433major+705419minor)
561.12user 171.06system12:29  97%CPU (446949major+712525minor)
625.68user 2833.29system 1:12:38  79%CPU (638957major+1463974minor)
===
2.4.1pre8,132MB swap
548.71user 117.93system11:09  99%CPU (442434major+705420minor)
558.93user 166.82system12:20  98%CPU (446941major+712662minor)
621.37user 2592.54system 1:07:33  79%CPU (592679major+1311442minor)

 Below some kernel compile numbers on a 32 MB RAM + 32 MB swap box. The
 three lines mean compilation with the -j1, -j2 and -j4 option. Most of
 the time 2.4.1pre8 was also unable to compile the kernel because cc1
 was killed by OOM handler.

 2.2.18
 548.27user 94.18system 10:50  98%CPU (450479major+696869minor)
 548.94user 153.85system11:51  98%CPU (487111major+704948minor)
 599.44user 2018.66system   51:47  84%CPU (2295045major+1182819minor)
 =
 2.4.0
 557.18user 121.57system11:25  99%CPU (442434major+705429minor)
 551.76user 158.78system12:11  97%CPU (446183major+711572minor)
 579.65user 2860.53system 1:05:45  87%CPU (650964major+1209969minor)
 ===
 2.4.0+blk-13B
 546.89user 140.35system11:33  99%CPU (442435major+705424minor)
 570.73user 188.51system12:56  97%CPU (445171major+712791minor)
 566.33user 2681.20system 1:02:26  86%CPU (654402major+1225784minor)
 =
 2.4.1pre8
 546.23user 118.81system11:09  99%CPU (442434major+705424minor)
 569.12user 161.25system12:22  98%CPU (446667major+712457minor)
 727.58user 2489.96system 1:25:34  62%CPU (616240major+1375321minor)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.1pre8 slowdown on dbench tests

2001-01-18 Thread Szabolcs Szakacsits


On Fri, 19 Jan 2001, Jens Axboe wrote:
 On Fri, Jan 19 2001, Szabolcs Szakacsits wrote:
  Redone with big enough swap by requests.
  2.4.0,132MB swap
  548.81user 128.97system11:22  99%CPU (442433major+705419minor)
  561.12user 171.06system12:29  97%CPU (446949major+712525minor)
  625.68user 2833.29system 1:12:38  79%CPU (638957major+1463974minor)
  ===
  2.4.1pre8,132MB swap
  548.71user 117.93system11:09  99%CPU (442434major+705420minor)
  558.93user 166.82system12:20  98%CPU (446941major+712662minor)
  621.37user 2592.54system 1:07:33  79%CPU (592679major+1311442minor)

 Better, could you try with the number changes that Andrea suggested
 too? Thanks.

Helped intensive swapping a bit, degraded other cases [no or slight
swapping].

2.4.1pre8,32MB RAM,132MB swap,blk suggestion
544.19user 141.25system11:31  99%CPU (442419major+705411minor)
554.83user 191.57system12:41  98%CPU (445762major+710409minor)
612.05user 2551.37system 1:07:21  78%CPU (589623major+1313665minor)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug (really 830MB barrier question)

2001-01-09 Thread Szabolcs Szakacsits


On Tue, 9 Jan 2001, Dan Maas wrote:

> OK it's fairly obvious what's happening here. Your program is using
> its own allocator, which relies solely on brk() to obtain more
> memory.
[... good explanation here ...]
> Here's your short answer: ask the authors of your program to either
> 1) replace their custom allocator with regular malloc() or 2) enhance
> their custom allocator to use mmap. (or, buy some 64-bit hardware =)...)

3) ask kernel developers to get rid of this "brk hits the fixed start
address of mmapped areas" or the other way around complaints "mmapped
area should start at lower address" limitation. E.g. Solaris does
growing up heap, growing down mmap and fixed size stack at the top.

Wayne, the patch below should fix your barrier problem [1 GB physical
memory configuration], I used only with 2.2 kernels. Your app should
complain about out of memory around 2.7 GB (0xb000-0x08??),
but note that only 256 MB (0xc000-0xb000) left for shared
libraries, mmapped areas.

Good luck,

Szaka

--- linux-2.2.18/include/asm-i386/processor.h  Thu Dec 14 08:20:17 2000
+++ linux/include/asm-i386/processor.h  Tue Jan  9 17:50:49 2001
@@ -166,7 +166,7 @@
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
-#define TASK_UNMAPPED_BASE (TASK_SIZE / 3)
+#define TASK_UNMAPPED_BASE 0xb000

 /*
  * Size of io_bitmap in longwords: 32 is ports 0-0x3ff.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug (really 830MB barrier question)

2001-01-09 Thread Szabolcs Szakacsits


On Tue, 9 Jan 2001, Dan Maas wrote:

 OK it's fairly obvious what's happening here. Your program is using
 its own allocator, which relies solely on brk() to obtain more
 memory.
[... good explanation here ...]
 Here's your short answer: ask the authors of your program to either
 1) replace their custom allocator with regular malloc() or 2) enhance
 their custom allocator to use mmap. (or, buy some 64-bit hardware =)...)

3) ask kernel developers to get rid of this "brk hits the fixed start
address of mmapped areas" or the other way around complaints "mmapped
area should start at lower address" limitation. E.g. Solaris does
growing up heap, growing down mmap and fixed size stack at the top.

Wayne, the patch below should fix your barrier problem [1 GB physical
memory configuration], I used only with 2.2 kernels. Your app should
complain about out of memory around 2.7 GB (0xb000-0x08??),
but note that only 256 MB (0xc000-0xb000) left for shared
libraries, mmapped areas.

Good luck,

Szaka

--- linux-2.2.18/include/asm-i386/processor.h  Thu Dec 14 08:20:17 2000
+++ linux/include/asm-i386/processor.h  Tue Jan  9 17:50:49 2001
@@ -166,7 +166,7 @@
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
-#define TASK_UNMAPPED_BASE (TASK_SIZE / 3)
+#define TASK_UNMAPPED_BASE 0xb000

 /*
  * Size of io_bitmap in longwords: 32 is ports 0-0x3ff.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-08 Thread Szabolcs Szakacsits


Andi Kleen <[EMAIL PROTECTED]> wrote:
> On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> > package called MAGMA; at times this requires very large matrices. The
> > RSS can get up to 870MB; for some reason a MAGMA process under linux
> > thinks it has run out of memory at 870MB, regardless of the actual
> > memory/swap in the machine. MAGMA is single-threaded.
> I think it's caused by the way malloc maps its memory.
> Newer glibc should work a bit better by falling back to mmap even
> for smaller allocations (older does it only for very big ones)

AFAIK newer glibc = CVS glibc but the malloc() tune parameters
work via environment variables for the current stable ones as well,
e.g. to overcome the above "out of memory" one could do,
% export MALLOC_MMAP_MAX_=100
% export MALLOC_MMAP_THRESHOLD_=0
% magma

At default, on a 32bit Linux current stable glibc malloc uses brk
between 0x08??-0x4000 and max (MALLOC_MMAP_MAX_) 128 mmap if
the requested chunk is greater than 128 kB (MALLOC_MMAP_THRESHOLD_).
If MAGMA mallocs memory in less than 128 kB chunks then the above out
of memory behaviour is expected.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-08 Thread Szabolcs Szakacsits


Andi Kleen [EMAIL PROTECTED] wrote:
 On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
  package called MAGMA; at times this requires very large matrices. The
  RSS can get up to 870MB; for some reason a MAGMA process under linux
  thinks it has run out of memory at 870MB, regardless of the actual
  memory/swap in the machine. MAGMA is single-threaded.
 I think it's caused by the way malloc maps its memory.
 Newer glibc should work a bit better by falling back to mmap even
 for smaller allocations (older does it only for very big ones)

AFAIK newer glibc = CVS glibc but the malloc() tune parameters
work via environment variables for the current stable ones as well,
e.g. to overcome the above "out of memory" one could do,
% export MALLOC_MMAP_MAX_=100
% export MALLOC_MMAP_THRESHOLD_=0
% magma

At default, on a 32bit Linux current stable glibc malloc uses brk
between 0x08??-0x4000 and max (MALLOC_MMAP_MAX_) 128 mmap if
the requested chunk is greater than 128 kB (MALLOC_MMAP_THRESHOLD_).
If MAGMA mallocs memory in less than 128 kB chunks then the above out
of memory behaviour is expected.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH-2] Re: NR_RESERVED_FILES broken in 2.4 too

2000-12-11 Thread Szabolcs Szakacsits


On Sun, 10 Dec 2000, Tigran Aivazian wrote:
> On Sun, 10 Dec 2000, Szabolcs Szakacsits wrote:
> > - this comment from include/linux/fs.h should be deleted
> >   #define NR_RESERVED_FILES 10 /* reserved for root */
> well, not really -- it is "reserved" right now too, it is just root is
> allowed to use up all the reserved entries in the beginning and then when
> the normal user uses up all the "non-reserved" ones (from slab
> cache) there would be nothing left for the root.

And what real functionality does this provide? Close to nada. This is
why I told you if you are right then it's usefuless. So I think this
is a bug that was introduced accidentaly overlooking NR_RESERVED_FILES
functionality when get_empty_filp was rewritten to use the slab.

> But let us not argue about the above definition of "reserved" -- that is
> not productive.

Agree, this is why I made the patch ;) Also, this stupid
misunderstanding and waste of time between us is a *very* typical
example of the result of the super inferior Linux kernel source code
management. No way to dig up who and why dropped the reserved file
functionality about three years ago. "Hidden", unexplained patches
slip in almost every patch-set. Some developers think they can save a
huge amount of time by this "model", they just ignore other developers
and support people who need to understand what, when, why and by who a
changes happened. And because of lack of enough information [look,
both of us have and I think understand the code, still we don't agree]
the end result is that, apparently by now too many times the ball is
dropped back to these developers who get buried by even more job. This
is just one sign Linux has a hard future and unfortunately there are
others  In general Linux is still one of the best today but
without addressing and solving current development problems it will
not be true after a couple of years. Linux remains just another Unix
and lose in 1:100 to another OS. The source is with us but it should
be used properly 

> Let's do something productive -- namely, take your idea to
> the next logical step. Since you have proven that the freelist mechanism
> or concept of "reserve file structures" is not 100% satisfactory as is

This is also a difference between us. You look the problem from a
theoretical point of you, saying it's not 100%, I consider it from
practical point of you and say it gives close to 0% functionality for
users.

> then how about removing the freelist altogether? I.e. what about serving

I'm fine with the current implementation and more interested in bug
fixes. There could be one reason against the patch, performance. The
patch below has the same fix and TUX will give exactly the same
numbers [get_empty_filp code remains ugly but at least fast].

Szaka

diff -ur linux-2.4.0-test12-pre7/fs/file_table.c linux/fs/file_table.c
--- linux-2.4.0-test12-pre7/fs/file_table.c Fri Dec  8 08:17:12 2000
+++ linux/fs/file_table.c   Mon Dec 11 10:40:41 2000
@@ -57,7 +57,9 @@
/*
 * Allocate a new one if we're below the limit.
 */
-   if (files_stat.nr_files < files_stat.max_files) {
+   if ((files_stat.nr_files < files_stat.max_files) && (!current->euid ||
+NR_RESERVED_FILES - files_stat.nr_free_files <
+files_stat.max_files - files_stat.nr_files)) {
file_list_unlock();
f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
file_list_lock();
diff -ur linux-2.4.0-test12-pre7/include/linux/fs.h linux/include/linux/fs.h
--- linux-2.4.0-test12-pre7/include/linux/fs.h  Fri Dec  8 15:06:55 2000
+++ linux/include/linux/fs.hSun Dec 10 17:37:52 2000
@@ -57,7 +57,7 @@
 extern int leases_enable, dir_notify_enable, lease_break_time;

 #define NR_FILE  8192  /* this can well be larger on a larger system */
-#define NR_RESERVED_FILES 10 /* reserved for root */
+#define NR_RESERVED_FILES 128 /* reserved for root */
 #define NR_SUPER 256

 #define MAY_EXEC 1

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH-2] Re: NR_RESERVED_FILES broken in 2.4 too

2000-12-11 Thread Szabolcs Szakacsits


On Sun, 10 Dec 2000, Tigran Aivazian wrote:
 On Sun, 10 Dec 2000, Szabolcs Szakacsits wrote:
  - this comment from include/linux/fs.h should be deleted
#define NR_RESERVED_FILES 10 /* reserved for root */
 well, not really -- it is "reserved" right now too, it is just root is
 allowed to use up all the reserved entries in the beginning and then when
 the normal user uses up all the "non-reserved" ones (from slab
 cache) there would be nothing left for the root.

And what real functionality does this provide? Close to nada. This is
why I told you if you are right then it's usefuless. So I think this
is a bug that was introduced accidentaly overlooking NR_RESERVED_FILES
functionality when get_empty_filp was rewritten to use the slab.

 But let us not argue about the above definition of "reserved" -- that is
 not productive.

Agree, this is why I made the patch ;) Also, this stupid
misunderstanding and waste of time between us is a *very* typical
example of the result of the super inferior Linux kernel source code
management. No way to dig up who and why dropped the reserved file
functionality about three years ago. "Hidden", unexplained patches
slip in almost every patch-set. Some developers think they can save a
huge amount of time by this "model", they just ignore other developers
and support people who need to understand what, when, why and by who a
changes happened. And because of lack of enough information [look,
both of us have and I think understand the code, still we don't agree]
the end result is that, apparently by now too many times the ball is
dropped back to these developers who get buried by even more job. This
is just one sign Linux has a hard future and unfortunately there are
others  In general Linux is still one of the best today but
without addressing and solving current development problems it will
not be true after a couple of years. Linux remains just another Unix
and lose in 1:100 to another OS. The source is with us but it should
be used properly 

 Let's do something productive -- namely, take your idea to
 the next logical step. Since you have proven that the freelist mechanism
 or concept of "reserve file structures" is not 100% satisfactory as is

This is also a difference between us. You look the problem from a
theoretical point of you, saying it's not 100%, I consider it from
practical point of you and say it gives close to 0% functionality for
users.

 then how about removing the freelist altogether? I.e. what about serving

I'm fine with the current implementation and more interested in bug
fixes. There could be one reason against the patch, performance. The
patch below has the same fix and TUX will give exactly the same
numbers [get_empty_filp code remains ugly but at least fast].

Szaka

diff -ur linux-2.4.0-test12-pre7/fs/file_table.c linux/fs/file_table.c
--- linux-2.4.0-test12-pre7/fs/file_table.c Fri Dec  8 08:17:12 2000
+++ linux/fs/file_table.c   Mon Dec 11 10:40:41 2000
@@ -57,7 +57,9 @@
/*
 * Allocate a new one if we're below the limit.
 */
-   if (files_stat.nr_files  files_stat.max_files) {
+   if ((files_stat.nr_files  files_stat.max_files)  (!current-euid ||
+NR_RESERVED_FILES - files_stat.nr_free_files 
+files_stat.max_files - files_stat.nr_files)) {
file_list_unlock();
f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
file_list_lock();
diff -ur linux-2.4.0-test12-pre7/include/linux/fs.h linux/include/linux/fs.h
--- linux-2.4.0-test12-pre7/include/linux/fs.h  Fri Dec  8 15:06:55 2000
+++ linux/include/linux/fs.hSun Dec 10 17:37:52 2000
@@ -57,7 +57,7 @@
 extern int leases_enable, dir_notify_enable, lease_break_time;

 #define NR_FILE  8192  /* this can well be larger on a larger system */
-#define NR_RESERVED_FILES 10 /* reserved for root */
+#define NR_RESERVED_FILES 128 /* reserved for root */
 #define NR_SUPER 256

 #define MAY_EXEC 1

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too

2000-12-10 Thread Szabolcs Szakacsits


On Sun, 10 Dec 2000, Tigran Aivazian wrote:

> If, however, you believe that the above _is_ the case but it should _not_
> happen then you are proposing a completely new policy of file structure
> allocation which you believe is superior. It is quite possible so let's
> all understand your new policy and let Linus decide whether it's better
> than the existing one. But if so, don't tell me you are fixing a bug
> because it is not a bug -- it's a redesign of file structure allocator.

If it's not a bug then

- this comment from include/linux/fs.h should be deleted
  #define NR_RESERVED_FILES 10 /* reserved for root */
- books should be updated
- people's mind also who believe kernel reserves fd's for superuser

Kernel from 2.1 plays lottery in this regards. And this would be
another sad fact that the kernel is exteremely poor *out of the box*
in regards security and relaibility ...

Szaka


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too

2000-12-10 Thread Szabolcs Szakacsits


On Sun, 10 Dec 2000, Tigran Aivazian wrote:

> problem (e.g. you mentioned something about allocating more than NR_FILES
> on SMP -- what do you mean?) which you are not explaining clearly.

E.g. situation, only one file struct left for allocation. One CPU goes
into get_empty_filp and before kmem_cache_alloc unlocks file_list,
another CPU gets also into get_empty_filp and locks file_list at the
top and goes on the same path, the end result potentially can be both
will increase nr_files instead of only one. But I don't think it's a
big issue at *present* that could cause any problems ...

> You just say "it is broken and here is the patch" but that, imho, is not
> enough. (ok, one could overcome the laziness and actually _read_ your
> patch to see what you _think_ is broken but surely it is better if you
> explain it yourself?).

Sorry I didn't explain, I thought it's short enough and significantly
faster to understand reading the code then my poor English ;)

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too

2000-12-10 Thread Szabolcs Szakacsits


On Sun, 10 Dec 2000, Tigran Aivazian wrote:

> > user% ./fd-exhaustion   # e.g. while(1) open("/dev/null",...);
> > root# cat /proc/sys/fs/file-nr
> > cat: /proc/sys/fs/file-nr: Too many open files in system
> >
> > The above happens even with increased NR_RESERVED_FILES to 96 [no
> > wonder, get_empty_filp is broken].
>
> no, it is not broken. But your experiment is broken. Don't do cat file-nr
> but compile this C program

Ok, now I understand why you can't see the problem ;) You lookup the
values in user space but I did it [additionally] in kernel space [also
I think I understand what happens ;)]. I guess with the code below you
claim I shouldn't see values like this when file struct allocations
started by user apps,
1024 0 1024

Or 0 shouldn't be between 0 and NR_RESERVED_FILES. Right? Wrong. I saw
it happens, you can reproduce it if you lookup the nr_free_files
value, allocate that much by root, don't release them and
immediately after this start to allocate fd's by user app. Note, if
you already hit nr_files = max_files you won't ever be able to
reproduce the above - but this is a half solution, kernel 2.0 was
fine, get_empty_filp was broke somewhere between 2.0 and 2.1 and it's
still broken. With the patch the functionality is back and also works
the way what the authors of the book mentioned believe ;)

It's quite funny, because before I was also told this is broken but I
couldn't believe it, so I look the code and tested it, the report was
right ...

Still disagree? ;)

Szaka

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int main(int argc, char *argv[])
> {
> int fd, len;
> static char buf[2048];
>
> fd = open("/proc/sys/fs/file-nr", O_RDONLY);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
> while (1) {
> len = read(fd, buf, 1024);
> printf("len=%d %s", len, buf);
> lseek(fd, 0, SEEK_SET);
> sleep(1);
> }
> return 0;
> }
>
> and leave it running while doing experiments on the other console. You
> will see that everything is fine -- there is no bug. No wonder you saw the
> bug -- you ignored my 4 emails telling you otherwise :)
>
> Regards,
> Tigran
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.2.18pre25: VM: do_try_to_free_pages failed for

2000-12-10 Thread Szabolcs Szakacsits


thunder7 wrote:
> for almost everything:
>Dec 10 13:33:47 middle kernel: VM: do_try_to_free_pages failed for kswapd...
[]
>  watched fsck mull over 60+ Gb>

You could try out my patch that "reserves" virtual memory for root, so
you should be able to login/ssh and clean up if your "faulty" or
memory hungry daemons aren't run by root -- it works fine for me
and I didn't get negative feedback so far:
http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_vm+oom_killer-5.diff
More on the patch,
http://boudicca.tux.org/hypermail/linux-kernel/2000week48/0624.html

> Most messages I was able to dig up about this mentioned 2.2.17 and
> suggested upgrading to 2.2.18pre. I didn't think there is anything
> changed between 2.2.18pre25 and 2.2.18pre26(2.2.18 to be) in VM
> handling, so the problem still seems to persist. What are the
> suggestions? Moving to 2.4 is not possible, since the isdn
> compression module isdn_lzscomp.o won't work in 2.4.

Andrea Arcangeli's VM global patch got good feedback and according to
Alan Cox it's a potential candidate for 2.2.19,
ftp://ftp.nl.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre18/VM-global-2.2.18pre18-7.bz2

Good luck,

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] NR_RESERVED_FILES broken in 2.4 too

2000-12-10 Thread Szabolcs Szakacsits


On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
> On Thu, 7 Dec 2000, Tigran Aivazian wrote:
> > On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
> > > Read the whole get_empty_filp function, especially this part,
> > I have read the whole function, including the above code, of course. The
> > new_one label has nothing to do with freelists -- it adds the file to the
> > anon_list, where the new arrivales from the slab cache go. The goto
> > new_one above is there simply to initialize the structure with sane
> > initial values
> OK, 2.2 has put_inuse(f); instead of putting it to anon_list, so 2.4
> seems ok.

Back to common sense ;) Nevertheless what you wrote additionally
get_empty_filp returns an allocated file struct that gets to be used.
So ignoring your four emails arguing kernel is ok, I downloaded
2.4-test11-pre7 and tried it out.

root# echo  1024 > /proc/sys/fs/file-max

Unpatched kernel,

user% ./fd-exhaustion   # e.g. while(1) open("/dev/null",...);
root# cat /proc/sys/fs/file-nr
cat: /proc/sys/fs/file-nr: Too many open files in system

The above happens even with increased NR_RESERVED_FILES to 96 [no
wonder, get_empty_filp is broken].

With the patch below,

user% ./fd-exhaustion
root# cat /proc/sys/fs/file-nr
946 0   1024
 or
1024   78   1024
 or
something that also works

The patch also has a fix not to allocate potentially more file
structs than NR_FILES on SMP.

Unfortunately NR_RESERVED_FILES needs to be increased to be useful
[i.e. e.g. to make ssh|login+ps|kill work for superuser]. Other way
would be to more aggressively free unused file structs if kernel is
short on free fd's.

> > There are even books (Understanding the Linux
> > Kernel by Bovet et all) which describe this freelist in the
> > current context so your patch will require updates to the books.

Checked this part of the book, ok for 2.0 but not for 2.[24].

Szaka

diff -ur linux-2.4.0-test12-pre7/fs/file_table.c linux/fs/file_table.c
--- linux-2.4.0-test12-pre7/fs/file_table.c Fri Dec  8 08:17:12 2000
+++ linux/fs/file_table.c   Sun Dec 10 17:05:55 2000
@@ -32,39 +32,36 @@
 {
static int old_max = 0;
struct file * f;
+   int total_free;

file_list_lock();
-   if (files_stat.nr_free_files > NR_RESERVED_FILES) {
-   used_one:
-   f = list_entry(free_list.next, struct file, f_list);
-   list_del(>f_list);
-   files_stat.nr_free_files--;
-   new_one:
-   memset(f, 0, sizeof(*f));
-   atomic_set(>f_count,1);
-   f->f_version = ++event;
-   f->f_uid = current->fsuid;
-   f->f_gid = current->fsgid;
-   list_add(>f_list, _list);
-   file_list_unlock();
-   return f;
-   }
-   /*
-* Use a reserved one if we're the superuser
-*/
-   if (files_stat.nr_free_files && !current->euid)
-   goto used_one;
-   /*
-* Allocate a new one if we're below the limit.
-*/
-   if (files_stat.nr_files < files_stat.max_files) {
+   total_free = files_stat.max_files - files_stat.nr_files + 
+files_stat.nr_free_files;
+   if (total_free > NR_RESERVED_FILES || (total_free && !current->euid)) {
+   if (files_stat.nr_free_files) {
+   /* used_one */
+   f = list_entry(free_list.next, struct file, f_list);
+   list_del(>f_list);
+   files_stat.nr_free_files--;
+   new_one:
+   memset(f, 0, sizeof(*f));
+   atomic_set(>f_count,1);
+   f->f_version = ++event;
+   f->f_uid = current->fsuid;
+   f->f_gid = current->fsgid;
+   list_add(>f_list, _list);
+   file_list_unlock();
+   return f;
+   }
+   /*
+* Allocate a new one if we're below the limit.
+*/
+   files_stat.nr_files++;
file_list_unlock();
f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
file_list_lock();
-   if (f) {
-   files_stat.nr_files++;
+   if (f)
goto new_one;
-   }
+   files_stat.nr_files--;
/* Big problems... */
printk("VFS: filp allocation failed\n");

diff -ur linux-2.4.0-test12-pre7/include/linux/fs.h linux/include/linux/fs.h
--- linux-2.4.0-test12-pre7/include/linux/fs.h  Fri Dec  8 15:06:55 2000
+++ linux/include/linux/fs.hSun Dec 10 17:37:52 2000
@@ -57,7 +57,7 @@
 extern int leases_enable, dir_notify_enable, lease_break_time;

 #define NR_FILE  8192  

Re: 2.2.18pre25: VM: do_try_to_free_pages failed for

2000-12-10 Thread Szabolcs Szakacsits


thunder7 wrote:
 for almost everything:
Dec 10 13:33:47 middle kernel: VM: do_try_to_free_pages failed for kswapd...
[]
 tried to log in over the network, didn't work, pressed C-A-D and
 watched fsck mull over 60+ Gb

You could try out my patch that "reserves" virtual memory for root, so
you should be able to login/ssh and clean up if your "faulty" or
memory hungry daemons aren't run by root -- it works fine for me
and I didn't get negative feedback so far:
http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_vm+oom_killer-5.diff
More on the patch,
http://boudicca.tux.org/hypermail/linux-kernel/2000week48/0624.html

 Most messages I was able to dig up about this mentioned 2.2.17 and
 suggested upgrading to 2.2.18pre. I didn't think there is anything
 changed between 2.2.18pre25 and 2.2.18pre26(2.2.18 to be) in VM
 handling, so the problem still seems to persist. What are the
 suggestions? Moving to 2.4 is not possible, since the isdn
 compression module isdn_lzscomp.o won't work in 2.4.

Andrea Arcangeli's VM global patch got good feedback and according to
Alan Cox it's a potential candidate for 2.2.19,
ftp://ftp.nl.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre18/VM-global-2.2.18pre18-7.bz2

Good luck,

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too

2000-12-10 Thread Szabolcs Szakacsits


On Sun, 10 Dec 2000, Tigran Aivazian wrote:

  user% ./fd-exhaustion   # e.g. while(1) open("/dev/null",...);
  root# cat /proc/sys/fs/file-nr
  cat: /proc/sys/fs/file-nr: Too many open files in system
 
  The above happens even with increased NR_RESERVED_FILES to 96 [no
  wonder, get_empty_filp is broken].

 no, it is not broken. But your experiment is broken. Don't do cat file-nr
 but compile this C program

Ok, now I understand why you can't see the problem ;) You lookup the
values in user space but I did it [additionally] in kernel space [also
I think I understand what happens ;)]. I guess with the code below you
claim I shouldn't see values like this when file struct allocations
started by user apps,
1024 0 1024

Or 0 shouldn't be between 0 and NR_RESERVED_FILES. Right? Wrong. I saw
it happens, you can reproduce it if you lookup the nr_free_files
value, allocate that much by root, don't release them and
immediately after this start to allocate fd's by user app. Note, if
you already hit nr_files = max_files you won't ever be able to
reproduce the above - but this is a half solution, kernel 2.0 was
fine, get_empty_filp was broke somewhere between 2.0 and 2.1 and it's
still broken. With the patch the functionality is back and also works
the way what the authors of the book mentioned believe ;)

It's quite funny, because before I was also told this is broken but I
couldn't believe it, so I look the code and tested it, the report was
right ...

Still disagree? ;)

Szaka

 #include sys/types.h
 #include sys/stat.h
 #include unistd.h
 #include fcntl.h
 #include stdio.h
 #include stdlib.h

 int main(int argc, char *argv[])
 {
 int fd, len;
 static char buf[2048];

 fd = open("/proc/sys/fs/file-nr", O_RDONLY);
 if (fd == -1) {
 perror("open");
 exit(1);
 }
 while (1) {
 len = read(fd, buf, 1024);
 printf("len=%d %s", len, buf);
 lseek(fd, 0, SEEK_SET);
 sleep(1);
 }
 return 0;
 }

 and leave it running while doing experiments on the other console. You
 will see that everything is fine -- there is no bug. No wonder you saw the
 bug -- you ignored my 4 emails telling you otherwise :)

 Regards,
 Tigran



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too

2000-12-10 Thread Szabolcs Szakacsits


On Sun, 10 Dec 2000, Tigran Aivazian wrote:

 problem (e.g. you mentioned something about allocating more than NR_FILES
 on SMP -- what do you mean?) which you are not explaining clearly.

E.g. situation, only one file struct left for allocation. One CPU goes
into get_empty_filp and before kmem_cache_alloc unlocks file_list,
another CPU gets also into get_empty_filp and locks file_list at the
top and goes on the same path, the end result potentially can be both
will increase nr_files instead of only one. But I don't think it's a
big issue at *present* that could cause any problems ...

 You just say "it is broken and here is the patch" but that, imho, is not
 enough. (ok, one could overcome the laziness and actually _read_ your
 patch to see what you _think_ is broken but surely it is better if you
 explain it yourself?).

Sorry I didn't explain, I thought it's short enough and significantly
faster to understand reading the code then my poor English ;)

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] NR_RESERVED_FILES broken in 2.4 too

2000-12-10 Thread Szabolcs Szakacsits


On Sun, 10 Dec 2000, Tigran Aivazian wrote:

 If, however, you believe that the above _is_ the case but it should _not_
 happen then you are proposing a completely new policy of file structure
 allocation which you believe is superior. It is quite possible so let's
 all understand your new policy and let Linus decide whether it's better
 than the existing one. But if so, don't tell me you are fixing a bug
 because it is not a bug -- it's a redesign of file structure allocator.

If it's not a bug then

- this comment from include/linux/fs.h should be deleted
  #define NR_RESERVED_FILES 10 /* reserved for root */
- books should be updated
- people's mind also who believe kernel reserves fd's for superuser

Kernel from 2.1 plays lottery in this regards. And this would be
another sad fact that the kernel is exteremely poor *out of the box*
in regards security and relaibility ...

Szaka


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Broken NR_RESERVED_FILES

2000-12-07 Thread Szabolcs Szakacsits


On Thu, 7 Dec 2000, Tigran Aivazian wrote:

> On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
> > Read the whole get_empty_filp function, especially this part, note the
> > goto new_one below and the part you didn't include above [from
> > the new_one label],
> >
> > if (files_stat.nr_files < files_stat.max_files) {
> > file_list_unlock();
> > f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
> > file_list_lock();
> > if (f) {
> > files_stat.nr_files++;
> > goto new_one;
> > }
>
> I have read the whole function, including the above code, of course. The
> new_one label has nothing to do with freelists -- it adds the file to the
> anon_list, where the new arrivales from the slab cache go. The goto
> new_one above is there simply to initialize the structure with sane
> initial values

OK, 2.2 has

put_inuse(f);

instead of putting it to anon_list, so 2.4 seems ok.

Szaka

> So, the normal user _cannot_ take a file structure from the freelist
> unless it contains more than NR_RESERVED_FILE entries. Please read the
> whole function and see it for yourself.
>
> Regards,
> Tigran
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Broken NR_RESERVED_FILES

2000-12-07 Thread Szabolcs Szakacsits


On Thu, 7 Dec 2000, Tigran Aivazian wrote:
> On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
> > again. The failed logic is also clear from the kernel code [user
> > happily allocates when freelist < NR_RESERVED_FILES].
>
> is it clear to you? it is not clear to me, or rather the opposite seems
> clear. This is what the code looks like (in 2.4):
>
> struct file * get_empty_filp(void)
> {
> static int old_max = 0;
> struct file * f;
>
> file_list_lock();
> if (files_stat.nr_free_files > NR_RESERVED_FILES) {
> used_one:
> f = list_entry(free_list.next, struct file, f_list);
> list_del(>f_list);
> files_stat.nr_free_files--;
>
> so, a normal user is only allowed to allocate from the freelist when the
> number of elements on the freelist is > NR_RESERVED_FILES. I do not see
> how you are able to take elements from the freelist when the number is <
> NR_RESERVED_FILES unless you are a super-user, i.e. current->euid == 0.

Read the whole get_empty_filp function, especially this part, note the
goto new_one below and the part you didn't include above [from
the new_one label],

if (files_stat.nr_files < files_stat.max_files) {
file_list_unlock();
f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
file_list_lock();
if (f) {
files_stat.nr_files++;
goto new_one;
}

> Btw, while you are there (in 2.2 kernel) you may want to fix the

Sorry, no time.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Broken NR_RESERVED_FILES

2000-12-07 Thread Szabolcs Szakacsits


On Thu, 7 Dec 2000, Tigran Aivazian wrote:
> On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
> > Reserved fd's for superuser doesn't work.
> It does actually work,

What do you mean under "work"? I meant user apps are able to
exhaust fd's completely and none is left for superuser.

> but remember that the concept of "reserved file
> structures for superuser" is defined as "file structures to be taken from
> the freelist"

Yes, in this sense it works and it's also very close to unhelpfulness.

> whereas your patch below:
[...]
> allows one to allocate a file structure from the filp_cache slab cache if
> one is a superuser.

Or one is user and didn't hit yet the reserved fd's (and of course
superuser aren't able to allocate more then max_files).

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] Broken NR_RESERVED_FILES

2000-12-07 Thread Szabolcs Szakacsits


Reserved fd's for superuser doesn't work. Patch for 2.2 is below,
kernel 2.4.x also has this problem, fix is similar. The default
NR_RESERVED_FILES value also had to be increased (e.g. ssh, login
needs 36, ls 16, man 45 fd's, etc).

BTW, I have an updated version of my reserved VM for superuser +
improved/fixed version of Rik's out of memory killer patch for 2.2
here,

http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_vm+oom_killer-5.diff

It fixes the potential deadlock when kernel threads were blocked to
try to free pages - more details about the patch are in a former
email, http://boudicca.tux.org/hypermail/linux-kernel/2000week48/0624.html

Szaka

diff -ur linux-2.2.18pre21/fs/file_table.c linux/fs/file_table.c
--- linux-2.2.18pre21/fs/file_table.c   Tue Jan  4 13:12:23 2000
+++ linux/fs/file_table.c   Thu Dec  7 13:26:06 2000
@@ -71,30 +71,27 @@
 {
static int old_max = 0;
struct file * f;
+   int total_free;

-   if (nr_free_files > NR_RESERVED_FILES) {
-   used_one:
-   f = free_filps;
-   remove_filp(f);
-   nr_free_files--;
-   new_one:
-   memset(f, 0, sizeof(*f));
-   f->f_count = 1;
-   f->f_version = ++global_event;
-   f->f_uid = current->fsuid;
-   f->f_gid = current->fsgid;
-   put_inuse(f);
-   return f;
-   }
-   /*
-* Use a reserved one if we're the superuser
-*/
-   if (nr_free_files && !current->euid)
-   goto used_one;
-   /*
-* Allocate a new one if we're below the limit.
-*/
-   if (nr_files < max_files) {
+   total_free = max_files - nr_files + nr_free_files;
+   if (total_free > NR_RESERVED_FILES || (total_free && !current->euid)) {
+   if (nr_free_files) {
+   used_one:
+   f = free_filps;
+   remove_filp(f);
+   nr_free_files--;
+   new_one:
+   memset(f, 0, sizeof(*f));
+   f->f_count = 1;
+   f->f_version = ++global_event;
+   f->f_uid = current->fsuid;
+   f->f_gid = current->fsgid;
+   put_inuse(f);
+   return f;
+   }
+   /*
+* Allocate a new one if we're below the limit.
+   */
f = kmem_cache_alloc(filp_cache, SLAB_KERNEL);
if (f) {
nr_files++;
diff -ur linux-2.2.18pre21/include/linux/fs.h linux/include/linux/fs.h
--- linux-2.2.18pre21/include/linux/fs.hThu Nov  9 08:20:18 2000
+++ linux/include/linux/fs.hThu Dec  7 11:10:50 2000
@@ -51,7 +51,7 @@
 extern int max_super_blocks, nr_super_blocks;

 #define NR_FILE  4096  /* this can well be larger on a larger system */
-#define NR_RESERVED_FILES 10 /* reserved for root */
+#define NR_RESERVED_FILES 96 /* reserved for root */
 #define NR_SUPER 256

 #define MAY_EXEC 1

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] Broken NR_RESERVED_FILES

2000-12-07 Thread Szabolcs Szakacsits


Reserved fd's for superuser doesn't work. Patch for 2.2 is below,
kernel 2.4.x also has this problem, fix is similar. The default
NR_RESERVED_FILES value also had to be increased (e.g. ssh, login
needs 36, ls 16, man 45 fd's, etc).

BTW, I have an updated version of my reserved VM for superuser +
improved/fixed version of Rik's out of memory killer patch for 2.2
here,

http://mlf.linux.rulez.org/mlf/ezaz/reserved_root_vm+oom_killer-5.diff

It fixes the potential deadlock when kernel threads were blocked to
try to free pages - more details about the patch are in a former
email, http://boudicca.tux.org/hypermail/linux-kernel/2000week48/0624.html

Szaka

diff -ur linux-2.2.18pre21/fs/file_table.c linux/fs/file_table.c
--- linux-2.2.18pre21/fs/file_table.c   Tue Jan  4 13:12:23 2000
+++ linux/fs/file_table.c   Thu Dec  7 13:26:06 2000
@@ -71,30 +71,27 @@
 {
static int old_max = 0;
struct file * f;
+   int total_free;

-   if (nr_free_files  NR_RESERVED_FILES) {
-   used_one:
-   f = free_filps;
-   remove_filp(f);
-   nr_free_files--;
-   new_one:
-   memset(f, 0, sizeof(*f));
-   f-f_count = 1;
-   f-f_version = ++global_event;
-   f-f_uid = current-fsuid;
-   f-f_gid = current-fsgid;
-   put_inuse(f);
-   return f;
-   }
-   /*
-* Use a reserved one if we're the superuser
-*/
-   if (nr_free_files  !current-euid)
-   goto used_one;
-   /*
-* Allocate a new one if we're below the limit.
-*/
-   if (nr_files  max_files) {
+   total_free = max_files - nr_files + nr_free_files;
+   if (total_free  NR_RESERVED_FILES || (total_free  !current-euid)) {
+   if (nr_free_files) {
+   used_one:
+   f = free_filps;
+   remove_filp(f);
+   nr_free_files--;
+   new_one:
+   memset(f, 0, sizeof(*f));
+   f-f_count = 1;
+   f-f_version = ++global_event;
+   f-f_uid = current-fsuid;
+   f-f_gid = current-fsgid;
+   put_inuse(f);
+   return f;
+   }
+   /*
+* Allocate a new one if we're below the limit.
+   */
f = kmem_cache_alloc(filp_cache, SLAB_KERNEL);
if (f) {
nr_files++;
diff -ur linux-2.2.18pre21/include/linux/fs.h linux/include/linux/fs.h
--- linux-2.2.18pre21/include/linux/fs.hThu Nov  9 08:20:18 2000
+++ linux/include/linux/fs.hThu Dec  7 11:10:50 2000
@@ -51,7 +51,7 @@
 extern int max_super_blocks, nr_super_blocks;

 #define NR_FILE  4096  /* this can well be larger on a larger system */
-#define NR_RESERVED_FILES 10 /* reserved for root */
+#define NR_RESERVED_FILES 96 /* reserved for root */
 #define NR_SUPER 256

 #define MAY_EXEC 1

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Broken NR_RESERVED_FILES

2000-12-07 Thread Szabolcs Szakacsits


On Thu, 7 Dec 2000, Tigran Aivazian wrote:
 On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
  Reserved fd's for superuser doesn't work.
 It does actually work,

What do you mean under "work"? I meant user apps are able to
exhaust fd's completely and none is left for superuser.

 but remember that the concept of "reserved file
 structures for superuser" is defined as "file structures to be taken from
 the freelist"

Yes, in this sense it works and it's also very close to unhelpfulness.

 whereas your patch below:
[...]
 allows one to allocate a file structure from the filp_cache slab cache if
 one is a superuser.

Or one is user and didn't hit yet the reserved fd's (and of course
superuser aren't able to allocate more then max_files).

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Broken NR_RESERVED_FILES

2000-12-07 Thread Szabolcs Szakacsits


On Thu, 7 Dec 2000, Tigran Aivazian wrote:
 On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
  again. The failed logic is also clear from the kernel code [user
  happily allocates when freelist  NR_RESERVED_FILES].

 is it clear to you? it is not clear to me, or rather the opposite seems
 clear. This is what the code looks like (in 2.4):

 struct file * get_empty_filp(void)
 {
 static int old_max = 0;
 struct file * f;

 file_list_lock();
 if (files_stat.nr_free_files  NR_RESERVED_FILES) {
 used_one:
 f = list_entry(free_list.next, struct file, f_list);
 list_del(f-f_list);
 files_stat.nr_free_files--;

 so, a normal user is only allowed to allocate from the freelist when the
 number of elements on the freelist is  NR_RESERVED_FILES. I do not see
 how you are able to take elements from the freelist when the number is 
 NR_RESERVED_FILES unless you are a super-user, i.e. current-euid == 0.

Read the whole get_empty_filp function, especially this part, note the
goto new_one below and the part you didn't include above [from
the new_one label],

if (files_stat.nr_files  files_stat.max_files) {
file_list_unlock();
f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
file_list_lock();
if (f) {
files_stat.nr_files++;
goto new_one;
}

 Btw, while you are there (in 2.2 kernel) you may want to fix the

Sorry, no time.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Broken NR_RESERVED_FILES

2000-12-07 Thread Szabolcs Szakacsits


On Thu, 7 Dec 2000, Tigran Aivazian wrote:

 On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
  Read the whole get_empty_filp function, especially this part, note the
  goto new_one below and the part you didn't include above [from
  the new_one label],
 
  if (files_stat.nr_files  files_stat.max_files) {
  file_list_unlock();
  f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
  file_list_lock();
  if (f) {
  files_stat.nr_files++;
  goto new_one;
  }

 I have read the whole function, including the above code, of course. The
 new_one label has nothing to do with freelists -- it adds the file to the
 anon_list, where the new arrivales from the slab cache go. The goto
 new_one above is there simply to initialize the structure with sane
 initial values

OK, 2.2 has

put_inuse(f);

instead of putting it to anon_list, so 2.4 seems ok.

Szaka

 So, the normal user _cannot_ take a file structure from the freelist
 unless it contains more than NR_RESERVED_FILE entries. Please read the
 whole function and see it for yourself.

 Regards,
 Tigran



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Reserved root VM + OOM killer

2000-11-24 Thread Szabolcs Szakacsits


On Thu, 23 Nov 2000, Pavel Machek wrote:

> > HOW?
> > No performance loss, RAM is always fully utilized (except if no swap),
>
> Handheld machines never have any swap, and alwys have little RAM [trust me,
> velo1 I'm writing this on is so tuned that 100KB les and machine is useless].
>  Unless reservation  can be turned off, it is not acceptable. Okay, it can
> be tuned. Ok, then.
>
> [What about making default reserved space 10% of *swap* size?]

No. Many people uses no swap even if they have plenty of RAM. I wasn't
right when I wrote the "reserved" VM is on swap or in buffer/page
cache. I wanted to write the reserved VM is unused swap and/or it is
*used* as buffer/page cache until it's not needed by root. Left away
swap from the former sentence and you get no RAM is wasted at all ;)

Moreover the default value for boxes with less than 8MB is 0 pages (I
thought about "embedded" systems), it's 5 MB if the box has more then
100MB and 5% of the RAM but after considered it as part of the VM
between 8MB and 100MB. I found in my setup, at least 4 MB needed to be
useful if root wants to act sure. Of course this can be different in
other setups and application behaviours -- this is why it can be tuned
runtime. Using more "reserved" [this is really a stupid and not
accurate name] VM definitely helps :) BTW, apparently Solaris reserves
4 MB for root.

I also thought about making it a compile time option [for people using
Linux as embedded systmes] in that case you would have less than 25%
chance to save one page -- I would instead optimize the compiler ;)
 but maybe embedded systems use non-overcomittable memory
handling, I didn't look how they handle OOM.

I'm afraid I was also wrong about performance, here is a typical case
how standard 2.2 kernel works if OOM happens: killing gpm, vmstat,
syslogd, tail, httpd, zsh, identd, httpd, klogd, httpd, httpd, httpd
[the main httpd, web is dead], bad_app. If there is more bad_app
[working on the same problem but e.g. they were feeded by wrong input,
etc], then you have the big chance you must hit the reset button. With
Rik's OOM killer, the "right" processes are killed but I found the
system trashes too long and because of the constant memory pressure
you still must hit the reset button. With my patch + fixes of Rik's
OOM killer, the "right"  processes are killed fast [it's done only in
page fault, contrary to 2.4.0-test11 that has two OOM killer: one in
page fault and Rik's one ... pretty ugly] and you can do whatever you
want as root. It would be nice to see which one of the three cases
would finish a job first where multiply processes [not threads] work
on the same job saving the partial results and constantly producing
OOM.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Reserved root VM + OOM killer

2000-11-24 Thread Szabolcs Szakacsits


On Thu, 23 Nov 2000, Pavel Machek wrote:

  HOW?
  No performance loss, RAM is always fully utilized (except if no swap),

 Handheld machines never have any swap, and alwys have little RAM [trust me,
 velo1 I'm writing this on is so tuned that 100KB les and machine is useless].
  Unless reservation  can be turned off, it is not acceptable. Okay, it can
 be tuned. Ok, then.

 [What about making default reserved space 10% of *swap* size?]

No. Many people uses no swap even if they have plenty of RAM. I wasn't
right when I wrote the "reserved" VM is on swap or in buffer/page
cache. I wanted to write the reserved VM is unused swap and/or it is
*used* as buffer/page cache until it's not needed by root. Left away
swap from the former sentence and you get no RAM is wasted at all ;)

Moreover the default value for boxes with less than 8MB is 0 pages (I
thought about "embedded" systems), it's 5 MB if the box has more then
100MB and 5% of the RAM but after considered it as part of the VM
between 8MB and 100MB. I found in my setup, at least 4 MB needed to be
useful if root wants to act sure. Of course this can be different in
other setups and application behaviours -- this is why it can be tuned
runtime. Using more "reserved" [this is really a stupid and not
accurate name] VM definitely helps :) BTW, apparently Solaris reserves
4 MB for root.

I also thought about making it a compile time option [for people using
Linux as embedded systmes] in that case you would have less than 25%
chance to save one page -- I would instead optimize the compiler ;)
 but maybe embedded systems use non-overcomittable memory
handling, I didn't look how they handle OOM.

I'm afraid I was also wrong about performance, here is a typical case
how standard 2.2 kernel works if OOM happens: killing gpm, vmstat,
syslogd, tail, httpd, zsh, identd, httpd, klogd, httpd, httpd, httpd
[the main httpd, web is dead], bad_app. If there is more bad_app
[working on the same problem but e.g. they were feeded by wrong input,
etc], then you have the big chance you must hit the reset button. With
Rik's OOM killer, the "right" processes are killed but I found the
system trashes too long and because of the constant memory pressure
you still must hit the reset button. With my patch + fixes of Rik's
OOM killer, the "right"  processes are killed fast [it's done only in
page fault, contrary to 2.4.0-test11 that has two OOM killer: one in
page fault and Rik's one ... pretty ugly] and you can do whatever you
want as root. It would be nice to see which one of the three cases
would finish a job first where multiply processes [not threads] work
on the same job saving the partial results and constantly producing
OOM.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] Reserved root VM + OOM killer

2000-11-22 Thread Szabolcs Szakacsits


On Wed, 22 Nov 2000, Rik van Riel wrote:

> On Wed, 22 Nov 2000, Szabolcs Szakacsits wrote:
>
> >- OOM killing takes place only in do_page_fault() [no two places in
> > the kernel for process killing]
>
> ... disable OOM killing for non-x86 architectures.
> This doesn't seem like a smart move ;)
>
> > diff -urw linux-2.2.18pre21/arch/i386/mm/Makefile linux/arch/i386/mm/Makefile
> > --- linux-2.2.18pre21/arch/i386/mm/Makefile Fri Nov  1 04:56:43 1996
  ^
As I wrote, the OOM killer changes are x86 only at present. Other
arch's still use the default OOM killing defined in arch/*/mm/fault.c.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] Reserved root VM + OOM killer

2000-11-22 Thread Szabolcs Szakacsits



WHY?
Permanent memory need by user apps makes Linux uncontrollable in OOM
(out of memory) situation when OOM killer can't kill as fast as the
memory needed (and your superb 'free memory space' monitor/actor
developed in the last 5 years was also killed and init couldn't
restart it because of OOM). In the Unix world it's a common good
practice to reserve resources for root (see e.g. disk space, network
ports, file descriptors, processes, etc). Linux doesn't reserve
virtual memory for root so if OOM happens by user apps you get this
kind of messages as root when trying to make the system work again
properly or investigate what happend,

running a command from prompt:
  Memory exhausted
  Segmentation fault
  fork failed: resource temporarily unavailable
trying to login from console:
  Unable to load interpreter /lib/ld-linux.so.2
  error while loading shared libraries: libc.so.6:
  cannot map zero-fill pages: Cannot allocate memory
  error while loading shared libraries: libtermcap.so.2: failed
  to map segment from shared object: Cannot allocate memory
  xrealloc: cannot reallocate 128 bytes (0 bytes allocated)
  xmalloc: cannot allocate 562 bytes (0 bytes allocated)
trying to ssh via network:
  Received disconnect: Command terminated on signal 11.

WHAT?
This patch tries to reserve virtual memory for root, balance memory
usage between root and user apps if memory is overcommited and has
Rik's OOM killer that is much more clever about what to kill when OOM
happens than what's inlcuded in standard 2.2 kernels.

HOW?
No performance loss, RAM is always fully utilized (except if no swap),
the tunable reserved memory is on swap (or in caches) until it's
needed by root. There are two scenarios. When user apps don't
overcommit memory they will see only
  UVM = (real virtual memory) - (reserved virtual memory for root)
If memory is overcommited then user apps will also use the reserved
memory (otherwise there would be a performance loss as I guess) but
the kernel will try hard to push them back below UVM.

IN THE PATCH:
 - reserved VM for root
 - Rik's OOM killer from 2.4.0-test11 with "fixes":
   - PID 1 never gets killed by OOM killer
   - OOM killing takes place only in do_page_fault() [no two places in
the kernel for process killing]
   - niced processes are not penalized
 - IPC shared mem can only be kvazi-overcommited (i.e. request is
   successful only if there is enough VM at the request time)

NOTES:
 - it's for 2.2 (late) kernels [tested with 2.2.18pre21, applies to
 2.2.18pre22 as well]
 - Intel only [page fault handling is implemented differently
 in different architectures, no common hooks but easy to fix]
 - SMP not tested
 - GUI environment not tested
 - tests were done with constant brk, mmap, zfod, cow, IPC shm fork
 bombs on mostly a 64-128 RAM MB + 80 MB swap box.
 - using IPC shared mem still can "kill" the box (unused mem not
 freed). Use Solar Designer kernel security patch or set
 /proc/sys/kernel/shmall according to your VM
 - it's not for common fork bombs (use e.g. fair scheduler,
 Fork Bomb Defuser, etc against them). Use ulimit -u if you want
 to test the patch and don't have enough CPU power
 - the reserved virtual memory can be set runtime via
 /proc/sys/vm/reserved The value is in pages (4096 bytes on x86)
 - On SMP you should probably increase this value in the function
 of you CPU's
 - if you have GB's of VM you can experience malloc() scalability
 problems, use glibc 2.2, limit your VM, raise the limits
 via malloc environment variables, etc.

PROBLEMS:
 - if killable task is in TASK_UNINTERRUPTIBLE constantly [e.g. becasue
 of network fs (smb, nfs, etc) problems] then OOM killer won't
 work ... at least this is what I suspect
 - schedule() doesn't always immediately schedules the killable task
 - probably others I'm not aware of

Standard disclaimer applies. It worked fine for me but maybe it will
eat your whole computer and pets :) It's not perfect but seems good
enough and I definitely found it much better then what is in 2.2
kernels. Of course your experience can be completely different. Please
let me know.

Szaka

diff -urw linux-2.2.18pre21/arch/i386/mm/Makefile linux/arch/i386/mm/Makefile
--- linux-2.2.18pre21/arch/i386/mm/Makefile Fri Nov  1 04:56:43 1996
+++ linux/arch/i386/mm/Makefile Tue Nov 21 03:03:15 2000
@@ -8,6 +8,6 @@
 # Note 2! The CFLAGS definition is now in the main makefile...

 O_TARGET := mm.o
-O_OBJS  := init.o fault.o ioremap.o extable.o
+O_OBJS  := init.o fault.o ioremap.o extable.o ../../../mm/oom_kill.o

 include $(TOPDIR)/Rules.make
diff -urw linux-2.2.18pre21/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c
--- linux-2.2.18pre21/arch/i386/mm/fault.c  Wed May  3 20:16:31 2000
+++ linux/arch/i386/mm/fault.c  Tue Nov 21 05:49:36 2000
@@ -23,6 +23,7 @@
 #include 

 extern void die(const char *,struct pt_regs *,long);
+extern int oom_kill(void);

 /*
  * Ugly, ugly, 

[PATCH] Reserved root VM + OOM killer

2000-11-22 Thread Szabolcs Szakacsits



WHY?
Permanent memory need by user apps makes Linux uncontrollable in OOM
(out of memory) situation when OOM killer can't kill as fast as the
memory needed (and your superb 'free memory space' monitor/actor
developed in the last 5 years was also killed and init couldn't
restart it because of OOM). In the Unix world it's a common good
practice to reserve resources for root (see e.g. disk space, network
ports, file descriptors, processes, etc). Linux doesn't reserve
virtual memory for root so if OOM happens by user apps you get this
kind of messages as root when trying to make the system work again
properly or investigate what happend,

running a command from prompt:
  Memory exhausted
  Segmentation fault
  fork failed: resource temporarily unavailable
trying to login from console:
  Unable to load interpreter /lib/ld-linux.so.2
  error while loading shared libraries: libc.so.6:
  cannot map zero-fill pages: Cannot allocate memory
  error while loading shared libraries: libtermcap.so.2: failed
  to map segment from shared object: Cannot allocate memory
  xrealloc: cannot reallocate 128 bytes (0 bytes allocated)
  xmalloc: cannot allocate 562 bytes (0 bytes allocated)
trying to ssh via network:
  Received disconnect: Command terminated on signal 11.

WHAT?
This patch tries to reserve virtual memory for root, balance memory
usage between root and user apps if memory is overcommited and has
Rik's OOM killer that is much more clever about what to kill when OOM
happens than what's inlcuded in standard 2.2 kernels.

HOW?
No performance loss, RAM is always fully utilized (except if no swap),
the tunable reserved memory is on swap (or in caches) until it's
needed by root. There are two scenarios. When user apps don't
overcommit memory they will see only
  UVM = (real virtual memory) - (reserved virtual memory for root)
If memory is overcommited then user apps will also use the reserved
memory (otherwise there would be a performance loss as I guess) but
the kernel will try hard to push them back below UVM.

IN THE PATCH:
 - reserved VM for root
 - Rik's OOM killer from 2.4.0-test11 with "fixes":
   - PID 1 never gets killed by OOM killer
   - OOM killing takes place only in do_page_fault() [no two places in
the kernel for process killing]
   - niced processes are not penalized
 - IPC shared mem can only be kvazi-overcommited (i.e. request is
   successful only if there is enough VM at the request time)

NOTES:
 - it's for 2.2 (late) kernels [tested with 2.2.18pre21, applies to
 2.2.18pre22 as well]
 - Intel only [page fault handling is implemented differently
 in different architectures, no common hooks but easy to fix]
 - SMP not tested
 - GUI environment not tested
 - tests were done with constant brk, mmap, zfod, cow, IPC shm fork
 bombs on mostly a 64-128 RAM MB + 80 MB swap box.
 - using IPC shared mem still can "kill" the box (unused mem not
 freed). Use Solar Designer kernel security patch or set
 /proc/sys/kernel/shmall according to your VM
 - it's not for common fork bombs (use e.g. fair scheduler,
 Fork Bomb Defuser, etc against them). Use ulimit -u if you want
 to test the patch and don't have enough CPU power
 - the reserved virtual memory can be set runtime via
 /proc/sys/vm/reserved The value is in pages (4096 bytes on x86)
 - On SMP you should probably increase this value in the function
 of you CPU's
 - if you have GB's of VM you can experience malloc() scalability
 problems, use glibc 2.2, limit your VM, raise the limits
 via malloc environment variables, etc.

PROBLEMS:
 - if killable task is in TASK_UNINTERRUPTIBLE constantly [e.g. becasue
 of network fs (smb, nfs, etc) problems] then OOM killer won't
 work ... at least this is what I suspect
 - schedule() doesn't always immediately schedules the killable task
 - probably others I'm not aware of

Standard disclaimer applies. It worked fine for me but maybe it will
eat your whole computer and pets :) It's not perfect but seems good
enough and I definitely found it much better then what is in 2.2
kernels. Of course your experience can be completely different. Please
let me know.

Szaka

diff -urw linux-2.2.18pre21/arch/i386/mm/Makefile linux/arch/i386/mm/Makefile
--- linux-2.2.18pre21/arch/i386/mm/Makefile Fri Nov  1 04:56:43 1996
+++ linux/arch/i386/mm/Makefile Tue Nov 21 03:03:15 2000
@@ -8,6 +8,6 @@
 # Note 2! The CFLAGS definition is now in the main makefile...

 O_TARGET := mm.o
-O_OBJS  := init.o fault.o ioremap.o extable.o
+O_OBJS  := init.o fault.o ioremap.o extable.o ../../../mm/oom_kill.o

 include $(TOPDIR)/Rules.make
diff -urw linux-2.2.18pre21/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c
--- linux-2.2.18pre21/arch/i386/mm/fault.c  Wed May  3 20:16:31 2000
+++ linux/arch/i386/mm/fault.c  Tue Nov 21 05:49:36 2000
@@ -23,6 +23,7 @@
 #include asm/hardirq.h

 extern void die(const char *,struct pt_regs *,long);
+extern int oom_kill(void);

 /*
  * 

Re: [PATCH] Reserved root VM + OOM killer

2000-11-22 Thread Szabolcs Szakacsits


On Wed, 22 Nov 2000, Rik van Riel wrote:

 On Wed, 22 Nov 2000, Szabolcs Szakacsits wrote:

 - OOM killing takes place only in do_page_fault() [no two places in
  the kernel for process killing]

 ... disable OOM killing for non-x86 architectures.
 This doesn't seem like a smart move ;)

  diff -urw linux-2.2.18pre21/arch/i386/mm/Makefile linux/arch/i386/mm/Makefile
  --- linux-2.2.18pre21/arch/i386/mm/Makefile Fri Nov  1 04:56:43 1996
  ^
As I wrote, the OOM killer changes are x86 only at present. Other
arch's still use the default OOM killing defined in arch/*/mm/fault.c.

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: KPATCH] Reserve VM for root (was: Re: Looking for better VM)

2000-11-16 Thread Szabolcs Szakacsits


On Thu, 16 Nov 2000, Rik van Riel wrote:
> On Thu, 16 Nov 2000, Szabolcs Szakacsits wrote:
>   [snip exploit that really shouldn't take Linux down]

I don't really consider it as an exploit. It's a kind of workload
that's optimized for fast testing simulating many busy user daemons
(e.g. dynamically generating web pages). Everybody knows Slashdot
effect. A system was designed for a workload according to a budget and
other factors. But immediately the load gets *much* higher as it was
ever expected, the system starts to trash and nobody can login or
start new processes. You can pull off the cable but if it's a remote
box then you are really in a bad situation. Or if a local [e.g.
computing] batch job run away you also must hit the reset button.
Happens too many times that it should be really taken seriously now,
who don't believe should just search for typical OOM crash patterns of
user reports on different mailling lists/newsgroups.

> > This or something similar didn't kill the box [I've tried all local
> > DoS from Packetstorm that I could find]. Please send a working
> > example. Of course probably it's possible to trigger root owned
> > processes to eat memory eagerly by user apps but that's a problem in
> > the process design running as root and not a kernel issue.
> Not necessarily, but your patch will probably make a difference
> for quite a number of people...

Could you please explain what you mean? ;) I saw only ONE difference.
The system remains usable for root but not anybody else. Everything
else is the same as before. Of course I think there are still problems
with the patch but to be honest I don't know what they are ... except
those I wrote before -- e.g. the latest, not yet released version
definitely doesn't work with your OOM killer [system just trashes].
Can you explain why you put process killing in do_try_to_free_pages()
instead of the original place, do_page_fault()? I guess putting it in
do_page_fault() [if possible] would fix my current problem.

> > If you think fork() kills the box then ulimit the maximum number
> > of user processes (ulimit -u). This is a different issue and a
> > bad design in the scheduler (see e.g. Tru64 for a better one).
> My fair scheduler catches this one just fine. It hasn't
> been integrated in the kernel yet, but both VA Linux and
> Conectiva use it in their kernel RPM.

I know about two fair schedulers for Linux, one of them is yours but
I couldn't try them out yet. Anyway definitely a must ;)

> While this is not one of the sexy new kernel
> features, this will help quite a few system
> administrators and is destined to a long and
> healthy life inside kernel RPMs, maybe even
> in the main kernel tree (when 2.5 splits?).

Thanks for the feedback,

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: KPATCH] Reserve VM for root (was: Re: Looking for better VM)

2000-11-16 Thread Szabolcs Szakacsits


On Wed, 1 Jan 1997 [EMAIL PROTECTED] wrote:

>>main() { while(1) if (fork()) malloc(1); }
>>With the patch below I could ssh to the host and killall the offending
>>processes. To enable reserving VM space for root do
> what about main() { while(1) system("ftp localhost &"); }
> This. or so,ething similar should allow you to kill your machine
> even with your patch from normal user account

This or something similar didn't kill the box [I've tried all local
DoS from Packetstorm that I could find]. Please send a working
example. Of course probably it's possible to trigger root owned
processes to eat memory eagerly by user apps but that's a problem in
the process design running as root and not a kernel issue.

Note, I'm not discussing "local user can kill the box without limits",
I say Linux "deadlocks" [it starts its own autonom life and usually
your only chance is to hit the reset button] when there is continuous
VM pressure by user applications. If you think fork() kills the box
then ulimit the maximum number of user processes (ulimit -u). This is
a different issue and a bad design in the scheduler (see e.g. Tru64
for a better one).

BTW, I have a new version of the patch with that Linux behaves much
better from root's point of view when the memory is more significantly
overcommited. I'll post it if I have time [and there is interest].

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: KPATCH] Reserve VM for root (was: Re: Looking for better VM)

2000-11-16 Thread Szabolcs Szakacsits


On Wed, 1 Jan 1997 [EMAIL PROTECTED] wrote:

main() { while(1) if (fork()) malloc(1); }
With the patch below I could ssh to the host and killall the offending
processes. To enable reserving VM space for root do
 what about main() { while(1) system("ftp localhost "); }
 This. or so,ething similar should allow you to kill your machine
 even with your patch from normal user account

This or something similar didn't kill the box [I've tried all local
DoS from Packetstorm that I could find]. Please send a working
example. Of course probably it's possible to trigger root owned
processes to eat memory eagerly by user apps but that's a problem in
the process design running as root and not a kernel issue.

Note, I'm not discussing "local user can kill the box without limits",
I say Linux "deadlocks" [it starts its own autonom life and usually
your only chance is to hit the reset button] when there is continuous
VM pressure by user applications. If you think fork() kills the box
then ulimit the maximum number of user processes (ulimit -u). This is
a different issue and a bad design in the scheduler (see e.g. Tru64
for a better one).

BTW, I have a new version of the patch with that Linux behaves much
better from root's point of view when the memory is more significantly
overcommited. I'll post it if I have time [and there is interest].

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: KPATCH] Reserve VM for root (was: Re: Looking for better VM)

2000-11-16 Thread Szabolcs Szakacsits


On Thu, 16 Nov 2000, Rik van Riel wrote:
 On Thu, 16 Nov 2000, Szabolcs Szakacsits wrote:
   [snip exploit that really shouldn't take Linux down]

I don't really consider it as an exploit. It's a kind of workload
that's optimized for fast testing simulating many busy user daemons
(e.g. dynamically generating web pages). Everybody knows Slashdot
effect. A system was designed for a workload according to a budget and
other factors. But immediately the load gets *much* higher as it was
ever expected, the system starts to trash and nobody can login or
start new processes. You can pull off the cable but if it's a remote
box then you are really in a bad situation. Or if a local [e.g.
computing] batch job run away you also must hit the reset button.
Happens too many times that it should be really taken seriously now,
who don't believe should just search for typical OOM crash patterns of
user reports on different mailling lists/newsgroups.

  This or something similar didn't kill the box [I've tried all local
  DoS from Packetstorm that I could find]. Please send a working
  example. Of course probably it's possible to trigger root owned
  processes to eat memory eagerly by user apps but that's a problem in
  the process design running as root and not a kernel issue.
 Not necessarily, but your patch will probably make a difference
 for quite a number of people...

Could you please explain what you mean? ;) I saw only ONE difference.
The system remains usable for root but not anybody else. Everything
else is the same as before. Of course I think there are still problems
with the patch but to be honest I don't know what they are ... except
those I wrote before -- e.g. the latest, not yet released version
definitely doesn't work with your OOM killer [system just trashes].
Can you explain why you put process killing in do_try_to_free_pages()
instead of the original place, do_page_fault()? I guess putting it in
do_page_fault() [if possible] would fix my current problem.

  If you think fork() kills the box then ulimit the maximum number
  of user processes (ulimit -u). This is a different issue and a
  bad design in the scheduler (see e.g. Tru64 for a better one).
 My fair scheduler catches this one just fine. It hasn't
 been integrated in the kernel yet, but both VA Linux and
 Conectiva use it in their kernel RPM.

I know about two fair schedulers for Linux, one of them is yours but
I couldn't try them out yet. Anyway definitely a must ;)

 While this is not one of the sexy new kernel
 features, this will help quite a few system
 administrators and is destined to a long and
 healthy life inside kernel RPMs, maybe even
 in the main kernel tree (when 2.5 splits?).

Thanks for the feedback,

Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   >