subject:"\[RFC\] Parallelize IO for e2fsck"

Re: [RFC] Parallelize IO for e2fsck

2008-02-03 Thread KOSAKI Motohiro

Hi Pavel

> > > As user pages are always in highmem, this should be easy to decide:
> > > only send SIGDANGER when highmem is full. (Yes, there are
> > > inodes/dentries/file descriptors in lowmem, but I doubt apps will
> > > respond to SIGDANGER by closing files).
> >
> > Good point; for a system with at least (say) 2GB of memory, that
> > definitely makes sense.  For a system with less than 768 megs of
> > memory (how quaint, but it wasn't that long ago this was a lot of
> > memory :-), there wouldn't *be* any memory in highmem at all
>
> Ok, so it is 'send SIGDANGER when all zones are low', because user
> allocations can go from all zones (unless you have something really
> exotic, I'm not sure if that is true on huge NUMA  machines & similar).

thank you good point out.

to be honest, the zone awareness of current mem_notify is premature.
I think we need enhancement rss statistics to per zone rss.
but not implemented yet ;-)

and, unfortunately I have no highmem machine.
the mem_notify is not so tested on highmem machine.

if you help to test, I am very happy!
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-02-03 Thread KOSAKI Motohiro

Hi Jon

> I looked at this a year or two back, then ran out of time. But the thing
> I wanted to do was have libc's memory allocation routines extended to
> handle these through reservations - the kernel should send a userspace
> notification and then there should be some kind of concept of returning
> memory that's been used for "opportunistic" userspace caching, e.g. in
> firefox to cache the last 10 web pages. Let us know how you get on :)

sorry for late response.
(I didn't notice your mail ;-)

You are right...
stupid user space caching is very important problem.

but I think this is no libc problem.
glibc malloc hardly caches the memory.
(its default behavior only caching 128K.)

but some application use large memory for too opportunistic caching.
I understood we need propagandize that using mem_notify to application guys
after it merge mainline.

I have no idea of solve it easily.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-02-03 Thread KOSAKI Motohiro

Hi Jon

 I looked at this a year or two back, then ran out of time. But the thing
 I wanted to do was have libc's memory allocation routines extended to
 handle these through reservations - the kernel should send a userspace
 notification and then there should be some kind of concept of returning
 memory that's been used for opportunistic userspace caching, e.g. in
 firefox to cache the last 10 web pages. Let us know how you get on :)

sorry for late response.
(I didn't notice your mail ;-)

You are right...
stupid user space caching is very important problem.

but I think this is no libc problem.
glibc malloc hardly caches the memory.
(its default behavior only caching 128K.)

but some application use large memory for too opportunistic caching.
I understood we need propagandize that using mem_notify to application guys
after it merge mainline.

I have no idea of solve it easily.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-02-03 Thread KOSAKI Motohiro

Hi Pavel

   As user pages are always in highmem, this should be easy to decide:
   only send SIGDANGER when highmem is full. (Yes, there are
   inodes/dentries/file descriptors in lowmem, but I doubt apps will
   respond to SIGDANGER by closing files).
 
  Good point; for a system with at least (say) 2GB of memory, that
  definitely makes sense.  For a system with less than 768 megs of
  memory (how quaint, but it wasn't that long ago this was a lot of
  memory :-), there wouldn't *be* any memory in highmem at all

 Ok, so it is 'send SIGDANGER when all zones are low', because user
 allocations can go from all zones (unless you have something really
 exotic, I'm not sure if that is true on huge NUMA  machines  similar).

thank you good point out.

to be honest, the zone awareness of current mem_notify is premature.
I think we need enhancement rss statistics to per zone rss.
but not implemented yet ;-)

and, unfortunately I have no highmem machine.
the mem_notify is not so tested on highmem machine.

if you help to test, I am very happy!
Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread david


On Mon, 28 Jan 2008, Theodore Tso wrote:


On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:


As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).


Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all


not to mention machines with 1G of ram (900M lowmem, 128M highmem)

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-01-28 Thread Jon Masters

On Sat, 2008-01-26 at 16:55 +0300, Al Boldi wrote:
> KOSAKI Motohiro wrote:
> > > > And from a performance point of view letting applications voluntarily
> > > > free some memory is better even than starting to swap.
> > >
> > > Absolutely.
> >
> > the mem_notify patch can realize "just before starting swapping"
> > notification :)

I looked at this a year or two back, then ran out of time. But the thing
I wanted to do was have libc's memory allocation routines extended to
handle these through reservations - the kernel should send a userspace
notification and then there should be some kind of concept of returning
memory that's been used for "opportunistic" userspace caching, e.g. in
firefox to cache the last 10 web pages. Let us know how you get on :)

Jon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Pavel Machek

On Mon 2008-01-28 14:56:33, Theodore Tso wrote:
> On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
> > 
> > As user pages are always in highmem, this should be easy to decide:
> > only send SIGDANGER when highmem is full. (Yes, there are
> > inodes/dentries/file descriptors in lowmem, but I doubt apps will
> > respond to SIGDANGER by closing files).
> 
> Good point; for a system with at least (say) 2GB of memory, that
> definitely makes sense.  For a system with less than 768 megs of
> memory (how quaint, but it wasn't that long ago this was a lot of
> memory :-), there wouldn't *be* any memory in highmem at all

Ok, so it is 'send SIGDANGER when all zones are low', because user
allocations can go from all zones (unless you have something really
exotic, I'm not sure if that is true on huge NUMA  machines & similar).

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Theodore Tso

On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
> 
> As user pages are always in highmem, this should be easy to decide:
> only send SIGDANGER when highmem is full. (Yes, there are
> inodes/dentries/file descriptors in lowmem, but I doubt apps will
> respond to SIGDANGER by closing files).

Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Pavel Machek

Hi!

> It's been discussed before, but I suspect the main reason why it was
> never done is no one submitted a patch.  Also, the problem is actually
> a pretty complex one.  There are a couple of different stages where
> you might want to send an alert to processes:
> 
> * Data is starting to get ejected from page/buffer cache
> * System is starting to swap
> * System is starting to really struggle to find memory
> * System is starting an out-of-memory killer
> 
> AIX's SIGDANGER really did the last two, where the OOM killer would
> tend to avoid processes that had a SIGDANGER handler in favor of
> processes that were SIGDANGER unaware.
> 
> Then there is the additional complexity in Linux that you have
> multiple zones of memory, which at least on the historically more
> popular x86 was highly, highly important.  You could say that whenever
> there is sufficient memory pressure in any zone that you start
> ejecting data from caches or start to swap that you start sending the
> signals --- but on x86 systems with lowmem, that could happen quite
> frequently, and since a user process has no idea whether its resources
> are in lowmem or highmem, there's not much you can do about this.

As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Pavel Machek

Hi!

 It's been discussed before, but I suspect the main reason why it was
 never done is no one submitted a patch.  Also, the problem is actually
 a pretty complex one.  There are a couple of different stages where
 you might want to send an alert to processes:
 
 * Data is starting to get ejected from page/buffer cache
 * System is starting to swap
 * System is starting to really struggle to find memory
 * System is starting an out-of-memory killer
 
 AIX's SIGDANGER really did the last two, where the OOM killer would
 tend to avoid processes that had a SIGDANGER handler in favor of
 processes that were SIGDANGER unaware.
 
 Then there is the additional complexity in Linux that you have
 multiple zones of memory, which at least on the historically more
 popular x86 was highly, highly important.  You could say that whenever
 there is sufficient memory pressure in any zone that you start
 ejecting data from caches or start to swap that you start sending the
 signals --- but on x86 systems with lowmem, that could happen quite
 frequently, and since a user process has no idea whether its resources
 are in lowmem or highmem, there's not much you can do about this.

As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Pavel Machek

On Mon 2008-01-28 14:56:33, Theodore Tso wrote:
 On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
  
  As user pages are always in highmem, this should be easy to decide:
  only send SIGDANGER when highmem is full. (Yes, there are
  inodes/dentries/file descriptors in lowmem, but I doubt apps will
  respond to SIGDANGER by closing files).
 
 Good point; for a system with at least (say) 2GB of memory, that
 definitely makes sense.  For a system with less than 768 megs of
 memory (how quaint, but it wasn't that long ago this was a lot of
 memory :-), there wouldn't *be* any memory in highmem at all

Ok, so it is 'send SIGDANGER when all zones are low', because user
allocations can go from all zones (unless you have something really
exotic, I'm not sure if that is true on huge NUMA  machines  similar).

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-01-28 Thread Jon Masters

On Sat, 2008-01-26 at 16:55 +0300, Al Boldi wrote:
 KOSAKI Motohiro wrote:
And from a performance point of view letting applications voluntarily
free some memory is better even than starting to swap.
  
   Absolutely.
 
  the mem_notify patch can realize just before starting swapping
  notification :)

I looked at this a year or two back, then ran out of time. But the thing
I wanted to do was have libc's memory allocation routines extended to
handle these through reservations - the kernel should send a userspace
notification and then there should be some kind of concept of returning
memory that's been used for opportunistic userspace caching, e.g. in
firefox to cache the last 10 web pages. Let us know how you get on :)

Jon.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread david


On Mon, 28 Jan 2008, Theodore Tso wrote:


On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:


As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).


Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all


not to mention machines with 1G of ram (900M lowmem, 128M highmem)

David Lang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Theodore Tso

On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
 
 As user pages are always in highmem, this should be easy to decide:
 only send SIGDANGER when highmem is full. (Yes, there are
 inodes/dentries/file descriptors in lowmem, but I doubt apps will
 respond to SIGDANGER by closing files).

Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-01-26 Thread KOSAKI Motohiro

Hi Al

> > the mem_notify patch can realize "just before starting swapping"
> > notification :)
> >
> > to be honest, I don't know fs guys requirement.
> > if lacking feature of fs guys needed, I implement it with presure if
> > you tell me it.
>
> These notifications are really useful, but it may be much wiser to pipe them
> thru some kevent-notification sub-system, instead of introducing kernel
> notifier-chain end-points left, right, and center.

Aaahh
Your feelings are understood well.
but current design is decided through many poeple discussion.

if anybody need kevent notification, I will add it to the current
implementation instead replace.

thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-01-26 Thread Al Boldi

KOSAKI Motohiro wrote:
> > > And from a performance point of view letting applications voluntarily
> > > free some memory is better even than starting to swap.
> >
> > Absolutely.
>
> the mem_notify patch can realize "just before starting swapping"
> notification :)
>
> to be honest, I don't know fs guys requirement.
> if lacking feature of fs guys needed, I implement it with presure if
> you tell me it.

These notifications are really useful, but it may be much wiser to pipe them 
thru some kevent-notification sub-system, instead of introducing kernel 
notifier-chain end-points left, right, and center.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread Theodore Tso

On Fri, Jan 25, 2008 at 05:55:51PM -0800, Bryan Henderson wrote:
> I was surprised to see AIX do late allocation by default, because IBM's 
> traditional style is bulletproof systems.  A system where a process can be 
> killed at unpredictable times because of resource demands of unrelated 
> processes doesn't really fit that style.
> 
> It's really a fairly unusual application that benefits from late 
> allocation: one that creates a lot more virtual memory than it ever 
> touches.  For example, a sparse array.  Or am I missing something?

I guess it depends on how far you try to do "bulletproof".  OSF/1 used
to use "bulletproof" as its default --- and I had to turn it off on
tsx-11.mit.edu (the first North American ftp server for Linux :-),
because the difference was something like 50 ftp daemons versus over
500 on the same server.  It reserved VM space for the text segement of
every single process, since at least in theory, it's possible for
every single text page to get modified using ptrace if (for example) a
debugger were to set a break point on every single page of every
single text segement of every single ftp daemon.

You can also see potential problems for Java programs.  Suppose you
had some gigantic Java Application (say, Lotus Notes, or Websphere
Application Server) which is taking up many, many, MANY gigabytes of
VM space.  Now suppose the Java application needs to fork and exec
some trivial helper program.  For that tiny instant, between the fork
and exec, the VM requirements in "bulletproof" mode would double,
since while 99.% of the time programs will immediately discard the
VM upon the exec, there is always the possibility that the child
process will touch every single data page, forcing a copy on write,
and never do the exec.

There are of course different levels of "bulletproof" between the
extremes of "totally bulletproof" and "late binding" from an
algorithmic standpoint.  For example, you could ignore the needed
pages caused by ptrace(); more challenging would be to how to handle
the fork/exec semantics, although there could be kludges such as
strongly encouraging applications to use an old-fashed BSD-style
vfork() to guarantee that the child couldn't double VM requirements
between the vfork() and exec().  I certainly can't say for sure what
the AIX designers had in mind, and why they didn't choose one of the
more intermediate design choices.  

However, it is fair to say that "100% bulletproof" can require
reserving far more VM resources than you might first expect.  Even a
company which is highly incented to sell large amounts of hardware,
such as Digital, might not have wanted their OS to be only able to
support an embarassingly small number of simultaneous ftpd
connections.  I know this for sure because the OSF/1 documentation,
when discussing their VM tuning knobs, specifically talked about the
scenario that I ran into with tsx-11.mit.edu.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread KOSAKI Motohiro

> > And from a performance point of view letting applications voluntarily
> > free some memory is better even than starting to swap.
>
> Absolutely.

the mem_notify patch can realize "just before starting swapping" notification :)

to be honest, I don't know fs guys requirement.
if lacking feature of fs guys needed, I implement it with presure if
you tell me it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread KOSAKI Motohiro

> The commentary on the mem_notify threads claimed that the signal is
> easily provided by setting up the file handle for SIGIO.

BTW:
Of cource, you can receive any signal instead SIGIO by use fcntl(F_SETSIG)  :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread KOSAKI Motohiro

 The commentary on the mem_notify threads claimed that the signal is
 easily provided by setting up the file handle for SIGIO.

BTW:
Of cource, you can receive any signal instead SIGIO by use fcntl(F_SETSIG)  :-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread KOSAKI Motohiro

  And from a performance point of view letting applications voluntarily
  free some memory is better even than starting to swap.

 Absolutely.

the mem_notify patch can realize just before starting swapping notification :)

to be honest, I don't know fs guys requirement.
if lacking feature of fs guys needed, I implement it with presure if
you tell me it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread Theodore Tso

On Fri, Jan 25, 2008 at 05:55:51PM -0800, Bryan Henderson wrote:
 I was surprised to see AIX do late allocation by default, because IBM's 
 traditional style is bulletproof systems.  A system where a process can be 
 killed at unpredictable times because of resource demands of unrelated 
 processes doesn't really fit that style.
 
 It's really a fairly unusual application that benefits from late 
 allocation: one that creates a lot more virtual memory than it ever 
 touches.  For example, a sparse array.  Or am I missing something?

I guess it depends on how far you try to do bulletproof.  OSF/1 used
to use bulletproof as its default --- and I had to turn it off on
tsx-11.mit.edu (the first North American ftp server for Linux :-),
because the difference was something like 50 ftp daemons versus over
500 on the same server.  It reserved VM space for the text segement of
every single process, since at least in theory, it's possible for
every single text page to get modified using ptrace if (for example) a
debugger were to set a break point on every single page of every
single text segement of every single ftp daemon.

You can also see potential problems for Java programs.  Suppose you
had some gigantic Java Application (say, Lotus Notes, or Websphere
Application Server) which is taking up many, many, MANY gigabytes of
VM space.  Now suppose the Java application needs to fork and exec
some trivial helper program.  For that tiny instant, between the fork
and exec, the VM requirements in bulletproof mode would double,
since while 99.% of the time programs will immediately discard the
VM upon the exec, there is always the possibility that the child
process will touch every single data page, forcing a copy on write,
and never do the exec.

There are of course different levels of bulletproof between the
extremes of totally bulletproof and late binding from an
algorithmic standpoint.  For example, you could ignore the needed
pages caused by ptrace(); more challenging would be to how to handle
the fork/exec semantics, although there could be kludges such as
strongly encouraging applications to use an old-fashed BSD-style
vfork() to guarantee that the child couldn't double VM requirements
between the vfork() and exec().  I certainly can't say for sure what
the AIX designers had in mind, and why they didn't choose one of the
more intermediate design choices.  

However, it is fair to say that 100% bulletproof can require
reserving far more VM resources than you might first expect.  Even a
company which is highly incented to sell large amounts of hardware,
such as Digital, might not have wanted their OS to be only able to
support an embarassingly small number of simultaneous ftpd
connections.  I know this for sure because the OSF/1 documentation,
when discussing their VM tuning knobs, specifically talked about the
scenario that I ran into with tsx-11.mit.edu.

Regards,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-01-26 Thread KOSAKI Motohiro

Hi Al

  the mem_notify patch can realize just before starting swapping
  notification :)
 
  to be honest, I don't know fs guys requirement.
  if lacking feature of fs guys needed, I implement it with presure if
  you tell me it.

 These notifications are really useful, but it may be much wiser to pipe them
 thru some kevent-notification sub-system, instead of introducing kernel
 notifier-chain end-points left, right, and center.

Aaahh
Your feelings are understood well.
but current design is decided through many poeple discussion.

if anybody need kevent notification, I will add it to the current
implementation instead replace.

thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Bryan Henderson

>> Incidentally, some context for the AIX approach to the OOM problem: a 
>> process may exclude itself from OOM vulnerability altogether.  It 
places 
>> itself in "early allocation" mode, which means at the time it creates 
>> virtual memory, it reserves enough backing store for the worst case. 
The 
>> memory manager does not send such a process the SIGDANGER signal or 
>> terminate it when it runs out of paging space.  Before c. 2000, this 
was 
>> the only mode.  Now the default is late allocation mode, which is 
similar 
>> to Linux.
>
>This is an interesting approach. It feels like some programs might be 
>interested in choosing this mode instead of risking OOM. 

It's the way virtual memory always worked when it was first invented.  The 
system not only reserved space to back every page of virtual memory; it 
assigned the particular blocks for it.  Late allocation was a later 
innovation, and I believe its main goal was to make it possible to use the 
cheaper disk drives for paging instead of drums.  Late allocation gives 
you better locality on disk, so the seeking doesn't eat you alive (drums 
don't seek).  Even then, I assume (but am not sure) that the system at 
least reserved the space in an account somewhere so at pageout time there 
was guaranteed to be a place to which to page out.  Overcommitting page 
space to save on disk space was a later idea.

I was surprised to see AIX do late allocation by default, because IBM's 
traditional style is bulletproof systems.  A system where a process can be 
killed at unpredictable times because of resource demands of unrelated 
processes doesn't really fit that style.

It's really a fairly unusual application that benefits from late 
allocation: one that creates a lot more virtual memory than it ever 
touches.  For example, a sparse array.  Or am I missing something?

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Zan Lynx


On Fri, 2008-01-25 at 04:09 -0700, Andreas Dilger wrote:
> On Jan 24, 2008  17:25 -0700, Zan Lynx wrote:
> > Have y'all been following the /dev/mem_notify patches?
> > http://article.gmane.org/gmane.linux.kernel/628653
> 
> Having the notification be via poll() is a very restrictive processing
> model.  Having the notification be via a signal means that any kind of
> process (and not just those that are event loop driven) can register
> a callback at some arbitrary point in the code and be notified.  I
> don't object to the poll() interface, but it would be good to have a
> signal mechanism also.

The commentary on the mem_notify threads claimed that the signal is
easily provided by setting up the file handle for SIGIO.

Yeah.  Here it is...copied from email written by KOSAKI Motohiro:

implement FASYNC capability to /dev/mem_notify.


fd = open("/dev/mem_notify", O_RDONLY);

fcntl(fd, F_SETOWN, getpid());

flags = fcntl(fd, F_GETFL);
fcntl(fd, F_SETFL, flags|FASYNC);  /* when low memory, receive SIGIO */

-- 
Zan Lynx <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part

Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Andreas Dilger

On Jan 24, 2008  17:25 -0700, Zan Lynx wrote:
> Have y'all been following the /dev/mem_notify patches?
> http://article.gmane.org/gmane.linux.kernel/628653

Having the notification be via poll() is a very restrictive processing
model.  Having the notification be via a signal means that any kind of
process (and not just those that are event loop driven) can register
a callback at some arbitrary point in the code and be notified.  I
don't object to the poll() interface, but it would be good to have a
signal mechanism also.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Bryan Henderson

 Incidentally, some context for the AIX approach to the OOM problem: a 
 process may exclude itself from OOM vulnerability altogether.  It 
places 
 itself in early allocation mode, which means at the time it creates 
 virtual memory, it reserves enough backing store for the worst case. 
The 
 memory manager does not send such a process the SIGDANGER signal or 
 terminate it when it runs out of paging space.  Before c. 2000, this 
was 
 the only mode.  Now the default is late allocation mode, which is 
similar 
 to Linux.

This is an interesting approach. It feels like some programs might be 
interested in choosing this mode instead of risking OOM. 

It's the way virtual memory always worked when it was first invented.  The 
system not only reserved space to back every page of virtual memory; it 
assigned the particular blocks for it.  Late allocation was a later 
innovation, and I believe its main goal was to make it possible to use the 
cheaper disk drives for paging instead of drums.  Late allocation gives 
you better locality on disk, so the seeking doesn't eat you alive (drums 
don't seek).  Even then, I assume (but am not sure) that the system at 
least reserved the space in an account somewhere so at pageout time there 
was guaranteed to be a place to which to page out.  Overcommitting page 
space to save on disk space was a later idea.

I was surprised to see AIX do late allocation by default, because IBM's 
traditional style is bulletproof systems.  A system where a process can be 
killed at unpredictable times because of resource demands of unrelated 
processes doesn't really fit that style.

It's really a fairly unusual application that benefits from late 
allocation: one that creates a lot more virtual memory than it ever 
touches.  For example, a sparse array.  Or am I missing something?

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Andreas Dilger

On Jan 24, 2008  17:25 -0700, Zan Lynx wrote:
 Have y'all been following the /dev/mem_notify patches?
 http://article.gmane.org/gmane.linux.kernel/628653

Having the notification be via poll() is a very restrictive processing
model.  Having the notification be via a signal means that any kind of
process (and not just those that are event loop driven) can register
a callback at some arbitrary point in the code and be notified.  I
don't object to the poll() interface, but it would be good to have a
signal mechanism also.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Zan Lynx


On Thu, 2008-01-24 at 18:40 -0500, Theodore Tso wrote:
> On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote:
> > In practice, there is a small number of programs that are both the
> > common memory hogs and should be able to reduce their memory consumption
> > by 10% or 20% without big problems when requested (e.g. Java VMs,
> > Firefox and databases come into my mind).
> 
> I agree, it's only a few processes where this makes sense.  But for
> those that do, it would be useful if they could register with the
> kernel that would like to know, (just before the system starts
> ejecting cached data, just before swapping, etc.) and at what
> frequency.  And presumably, if the kernel notices that a process is
> responding to such requests with memory actually getting released back
> to the system, that process could get "rewarded" by having the OOM
> killer less likely to target that particular thread.

Have y'all been following the /dev/mem_notify patches?
http://article.gmane.org/gmane.linux.kernel/628653

-- 
Zan Lynx <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Theodore Tso

On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote:
> In practice, there is a small number of programs that are both the
> common memory hogs and should be able to reduce their memory consumption
> by 10% or 20% without big problems when requested (e.g. Java VMs,
> Firefox and databases come into my mind).

I agree, it's only a few processes where this makes sense.  But for
those that do, it would be useful if they could register with the
kernel that would like to know, (just before the system starts
ejecting cached data, just before swapping, etc.) and at what
frequency.  And presumably, if the kernel notices that a process is
responding to such requests with memory actually getting released back
to the system, that process could get "rewarded" by having the OOM
killer less likely to target that particular thread.

AIX basically did this with SIGDANGER (the signal is ignored by
default), except there wasn't the ability for the process to tell the
kernel at what level of memory pressure before it should start getting
notified, and there was no way for the kernel to tell how bad the
memory pressure actually was.  On the other hand, it was a relatively
simple design.

In practice very few processes would indeed pay attention to
SIGDANGER, so I think you're quite right there.

> And from a performance point of view letting applications voluntarily 
> free some memory is better even than starting to swap.

Absolutely.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Adrian Bunk

On Thu, Jan 24, 2008 at 06:32:15PM +0100, Bodo Eggert wrote:
> Alan Cox <[EMAIL PROTECTED]> wrote:
> 
> >> I'd tried to advocate SIGDANGER some years ago as well, but none of
> >> the kernel maintainers were interested.  It definitely makes sense
> >> to have some sort of mechanism like this.  At the time I first brought
> >> it up it was in conjunction with Netscape using too much cache on some
> >> system, but it would be just as useful for all kinds of other memory-
> >> hungry applications.
> > 
> > There is an early thread for a /proc file which you can add to your
> > poll() set and it will wake people when memory is low. Very elegant and
> > if async support is added it will also give you the signal variant for
> > free.
> 
> IMO you'll need a userspace daemon. The kernel does only know about the
> amount of memory available / recommended for a system (or container),
> while the user knows which program's cache is most precious today.
> 
> (Off cause the userspace daemon will in turn need the /proc file.)
> 
> I think a single, system-wide signal is the second-to worst solution: All
> applications (or the wrong one, if you select one) would free their caches
> and start to crawl, and either stay in this state or slowly increase their
> caches again until they get signaled again. And the signal would either
> come too early or too late. The userspace daemon could collect the weighted
> demand of memory from all applications and tell them how much to use.

I don't think that's something that would require finetuning on a
per-application basis - the kernel should tell all applications once to
reduce memory consumption and write a fat warning to the logs (which
will on well-maintained systems be mailed to the admin).

Your "and tell them how much to use" wouldn't work for most applications 
- e.g. I've worked the last weeks with a computer with 512 MB RAM and no 
Swap, which means usually only 200 MB of free RAM. I've gotten quite 
used to git aborting with "fatal: Out of memory, malloc failed" when 
200 MB weren't enough for git, and I don't think there is any reasonable 
way for git to reduce the memory usage while continuing to run.

In practice, there is a small number of programs that are both the
common memory hogs and should be able to reduce their memory consumption
by 10% or 20% without big problems when requested (e.g. Java VMs,
Firefox and databases come into my mind).

And from a performance point of view letting applications voluntarily 
free some memory is better even than starting to swap.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Andreas Dilger

On Jan 24, 2008  18:32 +0100, Bodo Eggert wrote:
> I think a single, system-wide signal is the second-to worst solution: All
> applications (or the wrong one, if you select one) would free their caches
> and start to crawl, and either stay in this state or slowly increase their
> caches again until they get signaled again. And the signal would either
> come too early or too late. The userspace daemon could collect the weighted
> demand of memory from all applications and tell them how much to use.

Well, sending a few signals (maybe to the top 5 processes in the OOM killer
list) is still a LOT better than OOM-killing them without warning...  That
way important system processes could be taught to understand SIGDANGER and
maybe do something about it instead of being killed, and if Firefox and
other memory hungry processes flush some of their cache it is not fatal.

I wouldn't think that SIGDANGER means "free all of your cache", since the
memory usage clearly wasn't a problem a few seconds previously, so as
an application writer I'd code it as "flush the oldest 10% of my cache"
or similar, and the kernel could send SIGDANGER again (or kill the real
offender) if the memory usage again becomes an issue.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Bodo Eggert

Alan Cox <[EMAIL PROTECTED]> wrote:

>> I'd tried to advocate SIGDANGER some years ago as well, but none of
>> the kernel maintainers were interested.  It definitely makes sense
>> to have some sort of mechanism like this.  At the time I first brought
>> it up it was in conjunction with Netscape using too much cache on some
>> system, but it would be just as useful for all kinds of other memory-
>> hungry applications.
> 
> There is an early thread for a /proc file which you can add to your
> poll() set and it will wake people when memory is low. Very elegant and
> if async support is added it will also give you the signal variant for
> free.

IMO you'll need a userspace daemon. The kernel does only know about the
amount of memory available / recommended for a system (or container),
while the user knows which program's cache is most precious today.

(Off cause the userspace daemon will in turn need the /proc file.)

I think a single, system-wide signal is the second-to worst solution: All
applications (or the wrong one, if you select one) would free their caches
and start to crawl, and either stay in this state or slowly increase their
caches again until they get signaled again. And the signal would either
come too early or too late. The userspace daemon could collect the weighted
demand of memory from all applications and tell them how much to use.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Bodo Eggert

Alan Cox [EMAIL PROTECTED] wrote:

 I'd tried to advocate SIGDANGER some years ago as well, but none of
 the kernel maintainers were interested.  It definitely makes sense
 to have some sort of mechanism like this.  At the time I first brought
 it up it was in conjunction with Netscape using too much cache on some
 system, but it would be just as useful for all kinds of other memory-
 hungry applications.
 
 There is an early thread for a /proc file which you can add to your
 poll() set and it will wake people when memory is low. Very elegant and
 if async support is added it will also give you the signal variant for
 free.

IMO you'll need a userspace daemon. The kernel does only know about the
amount of memory available / recommended for a system (or container),
while the user knows which program's cache is most precious today.

(Off cause the userspace daemon will in turn need the /proc file.)

I think a single, system-wide signal is the second-to worst solution: All
applications (or the wrong one, if you select one) would free their caches
and start to crawl, and either stay in this state or slowly increase their
caches again until they get signaled again. And the signal would either
come too early or too late. The userspace daemon could collect the weighted
demand of memory from all applications and tell them how much to use.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Andreas Dilger

On Jan 24, 2008  18:32 +0100, Bodo Eggert wrote:
 I think a single, system-wide signal is the second-to worst solution: All
 applications (or the wrong one, if you select one) would free their caches
 and start to crawl, and either stay in this state or slowly increase their
 caches again until they get signaled again. And the signal would either
 come too early or too late. The userspace daemon could collect the weighted
 demand of memory from all applications and tell them how much to use.

Well, sending a few signals (maybe to the top 5 processes in the OOM killer
list) is still a LOT better than OOM-killing them without warning...  That
way important system processes could be taught to understand SIGDANGER and
maybe do something about it instead of being killed, and if Firefox and
other memory hungry processes flush some of their cache it is not fatal.

I wouldn't think that SIGDANGER means free all of your cache, since the
memory usage clearly wasn't a problem a few seconds previously, so as
an application writer I'd code it as flush the oldest 10% of my cache
or similar, and the kernel could send SIGDANGER again (or kill the real
offender) if the memory usage again becomes an issue.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Adrian Bunk

On Thu, Jan 24, 2008 at 06:32:15PM +0100, Bodo Eggert wrote:
 Alan Cox [EMAIL PROTECTED] wrote:
 
  I'd tried to advocate SIGDANGER some years ago as well, but none of
  the kernel maintainers were interested.  It definitely makes sense
  to have some sort of mechanism like this.  At the time I first brought
  it up it was in conjunction with Netscape using too much cache on some
  system, but it would be just as useful for all kinds of other memory-
  hungry applications.
  
  There is an early thread for a /proc file which you can add to your
  poll() set and it will wake people when memory is low. Very elegant and
  if async support is added it will also give you the signal variant for
  free.
 
 IMO you'll need a userspace daemon. The kernel does only know about the
 amount of memory available / recommended for a system (or container),
 while the user knows which program's cache is most precious today.
 
 (Off cause the userspace daemon will in turn need the /proc file.)
 
 I think a single, system-wide signal is the second-to worst solution: All
 applications (or the wrong one, if you select one) would free their caches
 and start to crawl, and either stay in this state or slowly increase their
 caches again until they get signaled again. And the signal would either
 come too early or too late. The userspace daemon could collect the weighted
 demand of memory from all applications and tell them how much to use.

I don't think that's something that would require finetuning on a
per-application basis - the kernel should tell all applications once to
reduce memory consumption and write a fat warning to the logs (which
will on well-maintained systems be mailed to the admin).

Your and tell them how much to use wouldn't work for most applications 
- e.g. I've worked the last weeks with a computer with 512 MB RAM and no 
Swap, which means usually only 200 MB of free RAM. I've gotten quite 
used to git aborting with fatal: Out of memory, malloc failed when 
200 MB weren't enough for git, and I don't think there is any reasonable 
way for git to reduce the memory usage while continuing to run.

In practice, there is a small number of programs that are both the
common memory hogs and should be able to reduce their memory consumption
by 10% or 20% without big problems when requested (e.g. Java VMs,
Firefox and databases come into my mind).

And from a performance point of view letting applications voluntarily 
free some memory is better even than starting to swap.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Theodore Tso

On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote:
 In practice, there is a small number of programs that are both the
 common memory hogs and should be able to reduce their memory consumption
 by 10% or 20% without big problems when requested (e.g. Java VMs,
 Firefox and databases come into my mind).

I agree, it's only a few processes where this makes sense.  But for
those that do, it would be useful if they could register with the
kernel that would like to know, (just before the system starts
ejecting cached data, just before swapping, etc.) and at what
frequency.  And presumably, if the kernel notices that a process is
responding to such requests with memory actually getting released back
to the system, that process could get rewarded by having the OOM
killer less likely to target that particular thread.

AIX basically did this with SIGDANGER (the signal is ignored by
default), except there wasn't the ability for the process to tell the
kernel at what level of memory pressure before it should start getting
notified, and there was no way for the kernel to tell how bad the
memory pressure actually was.  On the other hand, it was a relatively
simple design.

In practice very few processes would indeed pay attention to
SIGDANGER, so I think you're quite right there.

 And from a performance point of view letting applications voluntarily 
 free some memory is better even than starting to swap.

Absolutely.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Zan Lynx


On Thu, 2008-01-24 at 18:40 -0500, Theodore Tso wrote:
 On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote:
  In practice, there is a small number of programs that are both the
  common memory hogs and should be able to reduce their memory consumption
  by 10% or 20% without big problems when requested (e.g. Java VMs,
  Firefox and databases come into my mind).
 
 I agree, it's only a few processes where this makes sense.  But for
 those that do, it would be useful if they could register with the
 kernel that would like to know, (just before the system starts
 ejecting cached data, just before swapping, etc.) and at what
 frequency.  And presumably, if the kernel notices that a process is
 responding to such requests with memory actually getting released back
 to the system, that process could get rewarded by having the OOM
 killer less likely to target that particular thread.

Have y'all been following the /dev/mem_notify patches?
http://article.gmane.org/gmane.linux.kernel/628653

-- 
Zan Lynx [EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Bryan Henderson

>I think there is a clear need for applications to be able to
>register a callback from the kernel to indicate that the machine as
>a whole is running out of memory and that the application should
>trim it's caches to reduce memory utilisation.
>
>Perhaps instead of swapping immediately, a SIGLOWMEM could be sent ...

The problem with that approach is that the Fsck process doesn't know how 
its need for memory compares with other process' need for memory.  How 
much memory should it give up?  Maybe it should just quit altogether if 
other processes are in danger of deadlocking.  Or maybe it's best for it 
to keep all its memory and let some other frivolous process give up its 
memory instead.

It's the OS's job to have a view of the entire system and make resource 
allocation decisions.

If it's just a matter of the application choosing a better page frame to 
vacate than what the kernel would have taken, (which is more a matter of 
self-interest than resource allocation), then Fsck can do that more 
directly by just monitoring its own page fault rate.  If it's high, then 
it's using more real memory than the kernel thinks it's entitled to and it 
can reduce its memory footprint to improve its speed.  It can even check 
whether an access to readahead data caused a page fault; if so, it knows 
reading ahead is actually making things worse and therefore reduce 
readahead until the page faults stop happening.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Arnaldo Carvalho de Melo

Em Tue, Jan 22, 2008 at 09:40:52AM -0500, Theodore Tso escreveu:
> On Tue, Jan 22, 2008 at 12:00:50AM -0700, Andreas Dilger wrote:
> > > AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
> > > the system was about to hit OOM, not when it was about to start swapping.
> > 
> > I'd tried to advocate SIGDANGER some years ago as well, but none of
> > the kernel maintainers were interested.  It definitely makes sense
> > to have some sort of mechanism like this.  At the time I first brought
> > it up it was in conjunction with Netscape using too much cache on some
> > system, but it would be just as useful for all kinds of other memory-
> > hungry applications.
> 
> It's been discussed before, but I suspect the main reason why it was
> never done is no one submitted a patch.  Also, the problem is actually
> a pretty complex one.  There are a couple of different stages where
> you might want to send an alert to processes:

Isn't Marcelo, Riel and some other people working on memory
notifications?

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Theodore Tso

On Tue, Jan 22, 2008 at 12:00:50AM -0700, Andreas Dilger wrote:
> > AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
> > the system was about to hit OOM, not when it was about to start swapping.
> 
> I'd tried to advocate SIGDANGER some years ago as well, but none of
> the kernel maintainers were interested.  It definitely makes sense
> to have some sort of mechanism like this.  At the time I first brought
> it up it was in conjunction with Netscape using too much cache on some
> system, but it would be just as useful for all kinds of other memory-
> hungry applications.

It's been discussed before, but I suspect the main reason why it was
never done is no one submitted a patch.  Also, the problem is actually
a pretty complex one.  There are a couple of different stages where
you might want to send an alert to processes:

* Data is starting to get ejected from page/buffer cache
* System is starting to swap
* System is starting to really struggle to find memory
* System is starting an out-of-memory killer

AIX's SIGDANGER really did the last two, where the OOM killer would
tend to avoid processes that had a SIGDANGER handler in favor of
processes that were SIGDANGER unaware.

Then there is the additional complexity in Linux that you have
multiple zones of memory, which at least on the historically more
popular x86 was highly, highly important.  You could say that whenever
there is sufficient memory pressure in any zone that you start
ejecting data from caches or start to swap that you start sending the
signals --- but on x86 systems with lowmem, that could happen quite
frequently, and since a user process has no idea whether its resources
are in lowmem or highmem, there's not much you can do about this.

Hopefully this is less of an issue today, since the 2.6 VM is much
more better behaved, and people are gradually moving over to x86_64
anyway.  (Sorry SGI and Intel, unfortunately they're not moving over
to the Itanic :-).   So maybe this would be better received now.

Bringing us back to the main topic at hand, one of the tradeoffs in
Val's current approach is that by relying on the kernel's buffer
cache, we don't have to worry about locking and coherency at the
userspace level.  OTOH, we give up low-level control about when memory
gets thrown out, and it also means that simply getting notified when
the system starts to swap isn't good enough.  We need to know much
earlier, when the system starts ejecting data from the buffer and page
caches.

Does this matter?  Well, there are a couple of use cases:

 * The restricted boot environment
 * The background "once a month" take a snapshot and check
 * The oh-my-gosh we-lost-a-filesystem -- repair it while the 
   IMAP server is still on-line serving data from the other 
   mounted filesystems.

It's the last case where things get really tricky

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Alan Cox

> I'd tried to advocate SIGDANGER some years ago as well, but none of
> the kernel maintainers were interested.  It definitely makes sense
> to have some sort of mechanism like this.  At the time I first brought
> it up it was in conjunction with Netscape using too much cache on some
> system, but it would be just as useful for all kinds of other memory-
> hungry applications.

There is an early thread for a /proc file which you can add to your
poll() set and it will wake people when memory is low. Very elegant and
if async support is added it will also give you the signal variant for
free.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread David Chinner

On Tue, Jan 22, 2008 at 12:05:11AM -0700, Andreas Dilger wrote:
> On Jan 22, 2008  14:38 +1100, David Chinner wrote:
> > On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
> > > I discussed this with Ted at one point also.  This is a generic problem,
> > > not just for readahead, because "fsck" can run multiple e2fsck in parallel
> > > and in case of many large filesystems on a single node this can cause
> > > memory usage problems also.
> > > 
> > > What I was proposing is that "fsck.{fstype}" be modified to return an
> > > estimated minimum amount of memory needed, and some "desired" amount of
> > > memory (i.e. readahead) to fsck the filesystem, using some parameter like
> > > "fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
> > > return the output in the expected format, or returns an error then fsck
> > > will assume some amount of memory based on the device size and continue
> > > as it does today.
> > 
> > And while fsck is running, some other program runs that uses
> > memory and blows your carefully calculated paramters to smithereens?
> 
> Well, fsck has a rather restricted working environment, because it is
> run before most other processes start (i.e. single-user mode).  For fsck
> initiated by an admin in other runlevels the admin would need to specify
> the upper limit of memory usage.  My proposal was only for the single-user
> fsck at boot time.

The simple case. ;)

Because XFS has shutdown features, it's not uncommon to hear about
people running xfs_repair on an otherwise live system. e.g. XFS
detects a corrupted block, shuts down the filesystem, the admin
unmounts it, runs xfs_repair, puts it back online. meanwhile, all
the other filesystems and users continue unaffected. In this use
case, getting feedback about memory usage is, IMO, very worthwhile.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread David Chinner

On Tue, Jan 22, 2008 at 12:05:11AM -0700, Andreas Dilger wrote:
 On Jan 22, 2008  14:38 +1100, David Chinner wrote:
  On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
   I discussed this with Ted at one point also.  This is a generic problem,
   not just for readahead, because fsck can run multiple e2fsck in parallel
   and in case of many large filesystems on a single node this can cause
   memory usage problems also.
   
   What I was proposing is that fsck.{fstype} be modified to return an
   estimated minimum amount of memory needed, and some desired amount of
   memory (i.e. readahead) to fsck the filesystem, using some parameter like
   fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
   return the output in the expected format, or returns an error then fsck
   will assume some amount of memory based on the device size and continue
   as it does today.
  
  And while fsck is running, some other program runs that uses
  memory and blows your carefully calculated paramters to smithereens?
 
 Well, fsck has a rather restricted working environment, because it is
 run before most other processes start (i.e. single-user mode).  For fsck
 initiated by an admin in other runlevels the admin would need to specify
 the upper limit of memory usage.  My proposal was only for the single-user
 fsck at boot time.

The simple case. ;)

Because XFS has shutdown features, it's not uncommon to hear about
people running xfs_repair on an otherwise live system. e.g. XFS
detects a corrupted block, shuts down the filesystem, the admin
unmounts it, runs xfs_repair, puts it back online. meanwhile, all
the other filesystems and users continue unaffected. In this use
case, getting feedback about memory usage is, IMO, very worthwhile.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Alan Cox

 I'd tried to advocate SIGDANGER some years ago as well, but none of
 the kernel maintainers were interested.  It definitely makes sense
 to have some sort of mechanism like this.  At the time I first brought
 it up it was in conjunction with Netscape using too much cache on some
 system, but it would be just as useful for all kinds of other memory-
 hungry applications.

There is an early thread for a /proc file which you can add to your
poll() set and it will wake people when memory is low. Very elegant and
if async support is added it will also give you the signal variant for
free.

Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Theodore Tso

On Tue, Jan 22, 2008 at 12:00:50AM -0700, Andreas Dilger wrote:
  AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
  the system was about to hit OOM, not when it was about to start swapping.
 
 I'd tried to advocate SIGDANGER some years ago as well, but none of
 the kernel maintainers were interested.  It definitely makes sense
 to have some sort of mechanism like this.  At the time I first brought
 it up it was in conjunction with Netscape using too much cache on some
 system, but it would be just as useful for all kinds of other memory-
 hungry applications.

It's been discussed before, but I suspect the main reason why it was
never done is no one submitted a patch.  Also, the problem is actually
a pretty complex one.  There are a couple of different stages where
you might want to send an alert to processes:

* Data is starting to get ejected from page/buffer cache
* System is starting to swap
* System is starting to really struggle to find memory
* System is starting an out-of-memory killer

AIX's SIGDANGER really did the last two, where the OOM killer would
tend to avoid processes that had a SIGDANGER handler in favor of
processes that were SIGDANGER unaware.

Then there is the additional complexity in Linux that you have
multiple zones of memory, which at least on the historically more
popular x86 was highly, highly important.  You could say that whenever
there is sufficient memory pressure in any zone that you start
ejecting data from caches or start to swap that you start sending the
signals --- but on x86 systems with lowmem, that could happen quite
frequently, and since a user process has no idea whether its resources
are in lowmem or highmem, there's not much you can do about this.

Hopefully this is less of an issue today, since the 2.6 VM is much
more better behaved, and people are gradually moving over to x86_64
anyway.  (Sorry SGI and Intel, unfortunately they're not moving over
to the Itanic :-).   So maybe this would be better received now.

Bringing us back to the main topic at hand, one of the tradeoffs in
Val's current approach is that by relying on the kernel's buffer
cache, we don't have to worry about locking and coherency at the
userspace level.  OTOH, we give up low-level control about when memory
gets thrown out, and it also means that simply getting notified when
the system starts to swap isn't good enough.  We need to know much
earlier, when the system starts ejecting data from the buffer and page
caches.

Does this matter?  Well, there are a couple of use cases:

 * The restricted boot environment
 * The background once a month take a snapshot and check
 * The oh-my-gosh we-lost-a-filesystem -- repair it while the 
   IMAP server is still on-line serving data from the other 
   mounted filesystems.

It's the last case where things get really tricky

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Arnaldo Carvalho de Melo

Em Tue, Jan 22, 2008 at 09:40:52AM -0500, Theodore Tso escreveu:
 On Tue, Jan 22, 2008 at 12:00:50AM -0700, Andreas Dilger wrote:
   AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
   the system was about to hit OOM, not when it was about to start swapping.
  
  I'd tried to advocate SIGDANGER some years ago as well, but none of
  the kernel maintainers were interested.  It definitely makes sense
  to have some sort of mechanism like this.  At the time I first brought
  it up it was in conjunction with Netscape using too much cache on some
  system, but it would be just as useful for all kinds of other memory-
  hungry applications.
 
 It's been discussed before, but I suspect the main reason why it was
 never done is no one submitted a patch.  Also, the problem is actually
 a pretty complex one.  There are a couple of different stages where
 you might want to send an alert to processes:

Isn't Marcelo, Riel and some other people working on memory
notifications?

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Bryan Henderson

I think there is a clear need for applications to be able to
register a callback from the kernel to indicate that the machine as
a whole is running out of memory and that the application should
trim it's caches to reduce memory utilisation.

Perhaps instead of swapping immediately, a SIGLOWMEM could be sent ...

The problem with that approach is that the Fsck process doesn't know how 
its need for memory compares with other process' need for memory.  How 
much memory should it give up?  Maybe it should just quit altogether if 
other processes are in danger of deadlocking.  Or maybe it's best for it 
to keep all its memory and let some other frivolous process give up its 
memory instead.

It's the OS's job to have a view of the entire system and make resource 
allocation decisions.

If it's just a matter of the application choosing a better page frame to 
vacate than what the kernel would have taken, (which is more a matter of 
self-interest than resource allocation), then Fsck can do that more 
directly by just monitoring its own page fault rate.  If it's high, then 
it's using more real memory than the kernel thinks it's entitled to and it 
can reduce its memory footprint to improve its speed.  It can even check 
whether an access to readahead data caused a page fault; if so, it knows 
reading ahead is actually making things worse and therefore reduce 
readahead until the page faults stop happening.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 22, 2008  14:38 +1100, David Chinner wrote:
> On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
> > I discussed this with Ted at one point also.  This is a generic problem,
> > not just for readahead, because "fsck" can run multiple e2fsck in parallel
> > and in case of many large filesystems on a single node this can cause
> > memory usage problems also.
> > 
> > What I was proposing is that "fsck.{fstype}" be modified to return an
> > estimated minimum amount of memory needed, and some "desired" amount of
> > memory (i.e. readahead) to fsck the filesystem, using some parameter like
> > "fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
> > return the output in the expected format, or returns an error then fsck
> > will assume some amount of memory based on the device size and continue
> > as it does today.
> 
> And while fsck is running, some other program runs that uses
> memory and blows your carefully calculated paramters to smithereens?

Well, fsck has a rather restricted working environment, because it is
run before most other processes start (i.e. single-user mode).  For fsck
initiated by an admin in other runlevels the admin would need to specify
the upper limit of memory usage.  My proposal was only for the single-user
fsck at boot time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 21, 2008  23:17 -0500, [EMAIL PROTECTED] wrote:
> On Tue, 22 Jan 2008 14:38:30 +1100, David Chinner said:
> > Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
> > to a processes that aren't masking the signal followed by a short
> > grace period to allow the processes to free up some memory before
> > swapping out pages from that process?
> 
> AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
> the system was about to hit OOM, not when it was about to start swapping.

I'd tried to advocate SIGDANGER some years ago as well, but none of
the kernel maintainers were interested.  It definitely makes sense
to have some sort of mechanism like this.  At the time I first brought
it up it was in conjunction with Netscape using too much cache on some
system, but it would be just as useful for all kinds of other memory-
hungry applications.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Valdis . Kletnieks

On Tue, 22 Jan 2008 14:38:30 +1100, David Chinner said:

> Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
> to a processes that aren't masking the signal followed by a short
> grace period to allow the processes to free up some memory before
> swapping out pages from that process?

AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
the system was about to hit OOM, not when it was about to start swapping.

I suspect both approaches have their merits...


pgp1E2qCn6W5E.pgp
Description: PGP signature

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread David Chinner

On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
> On Jan 16, 2008  13:30 -0800, Valerie Henson wrote:
> > I have a partial solution that sort of blindly manages the buffer
> > cache.  First, the user passes e2fsck a parameter saying how much
> > memory is available as buffer cache.  The readahead thread reads
> > things in and immediately throws them away so they are only in buffer
> > cache (no double-caching).  Then readahead and e2fsck work together so
> > that readahead only reads in new blocks when the main thread is done
> > with earlier blocks.  The already-used blocks get kicked out of buffer
> > cache to make room for the new ones.
> >
> > What would be nice is to take into account the current total memory
> > usage of the whole fsck process and factor that in.  I don't think it
> > would be hard to add to the existing cache management framework.
> > Thoughts?
> 
> I discussed this with Ted at one point also.  This is a generic problem,
> not just for readahead, because "fsck" can run multiple e2fsck in parallel
> and in case of many large filesystems on a single node this can cause
> memory usage problems also.
> 
> What I was proposing is that "fsck.{fstype}" be modified to return an
> estimated minimum amount of memory needed, and some "desired" amount of
> memory (i.e. readahead) to fsck the filesystem, using some parameter like
> "fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
> return the output in the expected format, or returns an error then fsck
> will assume some amount of memory based on the device size and continue
> as it does today.

And while fsck is running, some other program runs that uses
memory and blows your carefully calculated paramters to smithereens?

I think there is a clear need for applications to be able to
register a callback from the kernel to indicate that the machine as
a whole is running out of memory and that the application should
trim it's caches to reduce memory utilisation.

Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
to a processes that aren't masking the signal followed by a short
grace period to allow the processes to free up some memory before
swapping out pages from that process?

With this sort of feedback, the fsck process can scale back it's
readahead and remove cached info that is not critical to what it
is currently doing and thereby prevent readahead thrashing as
memory usage of the fsck process itself grows.

Another example where this could be useful is to tell browsers to
release some of their cache rather than having the VM swap it out.

IMO, a scheme like this will be far more reliable than trying to
guess what the optimal settings are going to be over the whole
lifetime of a process

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 16, 2008  13:30 -0800, Valerie Henson wrote:
> I have a partial solution that sort of blindly manages the buffer
> cache.  First, the user passes e2fsck a parameter saying how much
> memory is available as buffer cache.  The readahead thread reads
> things in and immediately throws them away so they are only in buffer
> cache (no double-caching).  Then readahead and e2fsck work together so
> that readahead only reads in new blocks when the main thread is done
> with earlier blocks.  The already-used blocks get kicked out of buffer
> cache to make room for the new ones.
>
> What would be nice is to take into account the current total memory
> usage of the whole fsck process and factor that in.  I don't think it
> would be hard to add to the existing cache management framework.
> Thoughts?

I discussed this with Ted at one point also.  This is a generic problem,
not just for readahead, because "fsck" can run multiple e2fsck in parallel
and in case of many large filesystems on a single node this can cause
memory usage problems also.

What I was proposing is that "fsck.{fstype}" be modified to return an
estimated minimum amount of memory needed, and some "desired" amount of
memory (i.e. readahead) to fsck the filesystem, using some parameter like
"fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
return the output in the expected format, or returns an error then fsck
will assume some amount of memory based on the device size and continue
as it does today.

If the fsck.{fstype} does understand this parameter, then fsck makes a
decision based on devices, parallelism, total RAM (less some amount to
avoid thrashing), then it can call the individual fsck commands with
"--maximum-memory MMM /dev/XXX" so each knows how much cache it can
allocate.  This parameter can also be specified by the user if running
e2fsck directly.

I haven't looked through your patch yet, but I hope to get to it soon.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 16, 2008  13:30 -0800, Valerie Henson wrote:
 I have a partial solution that sort of blindly manages the buffer
 cache.  First, the user passes e2fsck a parameter saying how much
 memory is available as buffer cache.  The readahead thread reads
 things in and immediately throws them away so they are only in buffer
 cache (no double-caching).  Then readahead and e2fsck work together so
 that readahead only reads in new blocks when the main thread is done
 with earlier blocks.  The already-used blocks get kicked out of buffer
 cache to make room for the new ones.

 What would be nice is to take into account the current total memory
 usage of the whole fsck process and factor that in.  I don't think it
 would be hard to add to the existing cache management framework.
 Thoughts?

I discussed this with Ted at one point also.  This is a generic problem,
not just for readahead, because fsck can run multiple e2fsck in parallel
and in case of many large filesystems on a single node this can cause
memory usage problems also.

What I was proposing is that fsck.{fstype} be modified to return an
estimated minimum amount of memory needed, and some desired amount of
memory (i.e. readahead) to fsck the filesystem, using some parameter like
fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
return the output in the expected format, or returns an error then fsck
will assume some amount of memory based on the device size and continue
as it does today.

If the fsck.{fstype} does understand this parameter, then fsck makes a
decision based on devices, parallelism, total RAM (less some amount to
avoid thrashing), then it can call the individual fsck commands with
--maximum-memory MMM /dev/XXX so each knows how much cache it can
allocate.  This parameter can also be specified by the user if running
e2fsck directly.

I haven't looked through your patch yet, but I hope to get to it soon.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread David Chinner

On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
 On Jan 16, 2008  13:30 -0800, Valerie Henson wrote:
  I have a partial solution that sort of blindly manages the buffer
  cache.  First, the user passes e2fsck a parameter saying how much
  memory is available as buffer cache.  The readahead thread reads
  things in and immediately throws them away so they are only in buffer
  cache (no double-caching).  Then readahead and e2fsck work together so
  that readahead only reads in new blocks when the main thread is done
  with earlier blocks.  The already-used blocks get kicked out of buffer
  cache to make room for the new ones.
 
  What would be nice is to take into account the current total memory
  usage of the whole fsck process and factor that in.  I don't think it
  would be hard to add to the existing cache management framework.
  Thoughts?
 
 I discussed this with Ted at one point also.  This is a generic problem,
 not just for readahead, because fsck can run multiple e2fsck in parallel
 and in case of many large filesystems on a single node this can cause
 memory usage problems also.
 
 What I was proposing is that fsck.{fstype} be modified to return an
 estimated minimum amount of memory needed, and some desired amount of
 memory (i.e. readahead) to fsck the filesystem, using some parameter like
 fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
 return the output in the expected format, or returns an error then fsck
 will assume some amount of memory based on the device size and continue
 as it does today.

And while fsck is running, some other program runs that uses
memory and blows your carefully calculated paramters to smithereens?

I think there is a clear need for applications to be able to
register a callback from the kernel to indicate that the machine as
a whole is running out of memory and that the application should
trim it's caches to reduce memory utilisation.

Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
to a processes that aren't masking the signal followed by a short
grace period to allow the processes to free up some memory before
swapping out pages from that process?

With this sort of feedback, the fsck process can scale back it's
readahead and remove cached info that is not critical to what it
is currently doing and thereby prevent readahead thrashing as
memory usage of the fsck process itself grows.

Another example where this could be useful is to tell browsers to
release some of their cache rather than having the VM swap it out.

IMO, a scheme like this will be far more reliable than trying to
guess what the optimal settings are going to be over the whole
lifetime of a process

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 21, 2008  23:17 -0500, [EMAIL PROTECTED] wrote:
 On Tue, 22 Jan 2008 14:38:30 +1100, David Chinner said:
  Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
  to a processes that aren't masking the signal followed by a short
  grace period to allow the processes to free up some memory before
  swapping out pages from that process?
 
 AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
 the system was about to hit OOM, not when it was about to start swapping.

I'd tried to advocate SIGDANGER some years ago as well, but none of
the kernel maintainers were interested.  It definitely makes sense
to have some sort of mechanism like this.  At the time I first brought
it up it was in conjunction with Netscape using too much cache on some
system, but it would be just as useful for all kinds of other memory-
hungry applications.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 22, 2008  14:38 +1100, David Chinner wrote:
 On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
  I discussed this with Ted at one point also.  This is a generic problem,
  not just for readahead, because fsck can run multiple e2fsck in parallel
  and in case of many large filesystems on a single node this can cause
  memory usage problems also.
  
  What I was proposing is that fsck.{fstype} be modified to return an
  estimated minimum amount of memory needed, and some desired amount of
  memory (i.e. readahead) to fsck the filesystem, using some parameter like
  fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
  return the output in the expected format, or returns an error then fsck
  will assume some amount of memory based on the device size and continue
  as it does today.
 
 And while fsck is running, some other program runs that uses
 memory and blows your carefully calculated paramters to smithereens?

Well, fsck has a rather restricted working environment, because it is
run before most other processes start (i.e. single-user mode).  For fsck
initiated by an admin in other runlevels the admin would need to specify
the upper limit of memory usage.  My proposal was only for the single-user
fsck at boot time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-17 Thread Valerie Henson

On Jan 17, 2008 5:15 PM, David Chinner <[EMAIL PROTECTED]> wrote:
> On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote:
> > Hi y'all,
> >
> > This is a request for comments on the rewrite of the e2fsck IO
> > parallelization patches I sent out a few months ago.  The mechanism is
> > totally different.  Previously IO was parallelized by issuing IOs from
> > multiple threads; now a single thread issues fadvise(WILLNEED) and
> > then uses read() to complete the IO.
>
> Interesting.
>
> We ultimately rejected a similar patch to xfs_repair (pre-population
> the kernel block device cache) mainly because of low memory
> performance issues and it doesn't really enable you to do anything
> particularly smart with optimising I/O patterns for larger, high
> performance RAID arrays.
>
> The low memory problems were particularly bad; the readahead
> thrashing cause a slowdown of 2-3x compared to the baseline and
> often it was due to the repair process requiring all of memory
> to cache stuff it would need later. IIRC, multi-terabyte ext3
> filesystems have similar memory usage problems to XFS, so there's
> a good chance that this patch will see the same sorts of issues.

That was one of my first concerns - how to avoid overflowing memory?
Whenever I screw it up on e2fsck, it does go, oh, 2 times slower due
to the minor detail of every single block being read from disk twice.
:)

I have a partial solution that sort of blindly manages the buffer
cache.  First, the user passes e2fsck a parameter saying how much
memory is available as buffer cache.  The readahead thread reads
things in and immediately throws them away so they are only in buffer
cache (no double-caching).  Then readahead and e2fsck work together so
that readahead only reads in new blocks when the main thread is done
with earlier blocks.  The already-used blocks get kicked out of buffer
cache to make room for the new ones.

What would be nice is to take into account the current total memory
usage of the whole fsck process and factor that in.  I don't think it
would be hard to add to the existing cache management framework.
Thoughts?

> Promising results, though

Thanks!  It's solving a rather simpler problem than XFS check/repair. :)

-VAL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-17 Thread David Chinner

On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote:
> Hi y'all,
> 
> This is a request for comments on the rewrite of the e2fsck IO
> parallelization patches I sent out a few months ago.  The mechanism is
> totally different.  Previously IO was parallelized by issuing IOs from
> multiple threads; now a single thread issues fadvise(WILLNEED) and
> then uses read() to complete the IO.

Interesting.

We ultimately rejected a similar patch to xfs_repair (pre-population
the kernel block device cache) mainly because of low memory
performance issues and it doesn't really enable you to do anything
particularly smart with optimising I/O patterns for larger, high
performance RAID arrays.

The low memory problems were particularly bad; the readahead
thrashing cause a slowdown of 2-3x compared to the baseline and
often it was due to the repair process requiring all of memory
to cache stuff it would need later. IIRC, multi-terabyte ext3
filesystems have similar memory usage problems to XFS, so there's
a good chance that this patch will see the same sorts of issues.

> Single disk performance doesn't change, but elapsed time drops by
> about 50% on a big RAID-5 box.  Passes 1 and 2 are parallelized.  Pass
> 5 is left as an exercise for the reader.

Promising results, though

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-17 Thread David Chinner

On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote:
 Hi y'all,
 
 This is a request for comments on the rewrite of the e2fsck IO
 parallelization patches I sent out a few months ago.  The mechanism is
 totally different.  Previously IO was parallelized by issuing IOs from
 multiple threads; now a single thread issues fadvise(WILLNEED) and
 then uses read() to complete the IO.

Interesting.

We ultimately rejected a similar patch to xfs_repair (pre-population
the kernel block device cache) mainly because of low memory
performance issues and it doesn't really enable you to do anything
particularly smart with optimising I/O patterns for larger, high
performance RAID arrays.

The low memory problems were particularly bad; the readahead
thrashing cause a slowdown of 2-3x compared to the baseline and
often it was due to the repair process requiring all of memory
to cache stuff it would need later. IIRC, multi-terabyte ext3
filesystems have similar memory usage problems to XFS, so there's
a good chance that this patch will see the same sorts of issues.

 Single disk performance doesn't change, but elapsed time drops by
 about 50% on a big RAID-5 box.  Passes 1 and 2 are parallelized.  Pass
 5 is left as an exercise for the reader.

Promising results, though

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-17 Thread Valerie Henson

On Jan 17, 2008 5:15 PM, David Chinner [EMAIL PROTECTED] wrote:
 On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote:
  Hi y'all,
 
  This is a request for comments on the rewrite of the e2fsck IO
  parallelization patches I sent out a few months ago.  The mechanism is
  totally different.  Previously IO was parallelized by issuing IOs from
  multiple threads; now a single thread issues fadvise(WILLNEED) and
  then uses read() to complete the IO.

 Interesting.

 We ultimately rejected a similar patch to xfs_repair (pre-population
 the kernel block device cache) mainly because of low memory
 performance issues and it doesn't really enable you to do anything
 particularly smart with optimising I/O patterns for larger, high
 performance RAID arrays.

 The low memory problems were particularly bad; the readahead
 thrashing cause a slowdown of 2-3x compared to the baseline and
 often it was due to the repair process requiring all of memory
 to cache stuff it would need later. IIRC, multi-terabyte ext3
 filesystems have similar memory usage problems to XFS, so there's
 a good chance that this patch will see the same sorts of issues.

That was one of my first concerns - how to avoid overflowing memory?
Whenever I screw it up on e2fsck, it does go, oh, 2 times slower due
to the minor detail of every single block being read from disk twice.
:)

I have a partial solution that sort of blindly manages the buffer
cache.  First, the user passes e2fsck a parameter saying how much
memory is available as buffer cache.  The readahead thread reads
things in and immediately throws them away so they are only in buffer
cache (no double-caching).  Then readahead and e2fsck work together so
that readahead only reads in new blocks when the main thread is done
with earlier blocks.  The already-used blocks get kicked out of buffer
cache to make room for the new ones.

What would be nice is to take into account the current total memory
usage of the whole fsck process and factor that in.  I don't think it
would be hard to add to the existing cache management framework.
Thoughts?

 Promising results, though

Thanks!  It's solving a rather simpler problem than XFS check/repair. :)

-VAL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

61 matches

Mail list logo