Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-07 Thread Hans Reiser

"Stephen C. Tweedie" wrote:

> Hi,
>
> On Fri, 07 Jan 2000 00:32:48 +0300, Hans Reiser <[EMAIL PROTECTED]> said:
>
> > Andrea Arcangeli wrote:
> >> BTW, I thought Hans was talking about places that can't sleep (because of
> >> some not schedule-aware lock) when he said "place that cannot call
> >> balance_dirty()".
>
> > You were correct.  I think Stephen and I are missing in communicating here.
>
> Fine, I was just looking at it from the VFS point of view, not the
> specific filesystem.  In the worst case, a filesystem can always simply
> defer marking the buffer as dirty until after the locking window has
> passed, so there's obviously no fundamental problem with having a
> blocking mark_buffer_dirty.  If we want a non-blocking version too, with
> the requirement that the filesystem then to a manual rebalance once it
> is safe to do so, that will work fine too.
>
> --Stephen

Yes, but then you have to track what you defer.  Code complication.

I just want to leave things as they are until we have time to do SMP right.

When we do SMP right, then a mark_buffer_dirty() which causes schedule is not a
problem.  Let's deal with this in 2.5

Hans

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-07 Thread Andrea Arcangeli

On Fri, 7 Jan 2000, Stephen C. Tweedie wrote:

>Fine, I was just looking at it from the VFS point of view, not the
>specific filesystem.  In the worst case, a filesystem can always simply
>defer marking the buffer as dirty until after the locking window has
>passed, so there's obviously no fundamental problem with having a
>blocking mark_buffer_dirty.  If we want a non-blocking version too, with
>the requirement that the filesystem then to a manual rebalance once it
>is safe to do so, that will work fine too.

I did the new mark_buffer_dirty blocking and __mark_buffer_dirty
nonblocking while fixing the 2.3.x buffer code.


ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.36pre5/buffer-2.gz

I am running with above applyed since some day on a based 2.3.36 on Alpha
and all is worked fine so far under all kind of loads.

Andrea



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-07 Thread Stephen C. Tweedie

Hi,

On Fri, 07 Jan 2000 00:32:48 +0300, Hans Reiser <[EMAIL PROTECTED]> said:

> Andrea Arcangeli wrote:
>> BTW, I thought Hans was talking about places that can't sleep (because of
>> some not schedule-aware lock) when he said "place that cannot call
>> balance_dirty()".

> You were correct.  I think Stephen and I are missing in communicating here.

Fine, I was just looking at it from the VFS point of view, not the
specific filesystem.  In the worst case, a filesystem can always simply
defer marking the buffer as dirty until after the locking window has
passed, so there's obviously no fundamental problem with having a
blocking mark_buffer_dirty.  If we want a non-blocking version too, with
the requirement that the filesystem then to a manual rebalance once it
is safe to do so, that will work fine too.

--Stephen



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my

2000-01-07 Thread Stephen C. Tweedie

Hi,

On Thu, 6 Jan 2000 20:25:38 -0500 (EST), "Albert D. Cahalan"
<[EMAIL PROTECTED]> said:

> AIX has such an API already. It is good to clone if you can.

The AIX API is much more than a simple small-operation atomic
transaction API, isn't it?  The filesystem transactions have many
properties --- no abort, predictable size, short duration --- which make
a journaling engine inappropriate for use in a general purpose
user-visible transaction API.

--Stephen



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my

2000-01-06 Thread Albert D. Cahalan

Hans Reiser writes:

> Yes, but not before 2.5.  Chris and I have already discussed that
> it would be nice to make the transaction API available to user space,
> but we haven't done any work on it, or even specified the user API.

AIX has such an API already. It is good to clone if you can.

This ought to contain the API, but might require some digging:
http://www.rs6000.ibm.com/support/



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-06 Thread Hans Reiser

Andrea Arcangeli wrote:

> BTW, I thought Hans was talking about places that can't sleep (because of
> some not schedule-aware lock) when he said "place that cannot call
> balance_dirty()".

You were correct.  I think Stephen and I are missing in communicating here.


--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-06 Thread Andrea Arcangeli

BTW, I thought Hans was talking about places that can't sleep (because of
some not schedule-aware lock) when he said "place that cannot call
balance_dirty()".

On Thu, 6 Jan 2000, Stephen C. Tweedie wrote:

>It shouldn't be impossible: as long as we are protected against
>recursive invocations of balance_dirty (which should be easy to

I am not sure to understand correctly. In case the ll_rw_block layer
produces dirty buffers we are protected by wakeup_bdflush that become a
noop when recalled from kflushd (wakeup_bdflush is not blocking to avoid
bdflush waiting bdflush :). And in genral balance_dirty should never
recurse on the same stack.

>arrange) we should be safe enough, at least if the memory reservation
>bits of the VM/fs interaction are working so that the balance_dirty
>can guarantee to run to completion.

Hmm maybe you are talking about something else...

Andrea




Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-06 Thread Stephen C. Tweedie

Hi,

On Thu, 23 Dec 1999 06:41:44 +0800, Tan Pong Heng
<[EMAIL PROTECTED]> said:

> I was thinking that, unless you want to have FS specific buffer/page
> cache, there is alway a gain for a unified cache for all fs. I think
> the one piece of functionality missing from the 2.3 implementation
> is the dependency between the various pages. If you could specify a
> tree relations between the various subset of the buffer/page and the
> reclaim machanism honor that everything should be fine. For FS that
> does not care about ordering, they could simply ignore this
> capability and the machanism could assume that everything is in one
> big set and could be reclaimed in any order.

That just doesn't give you enough power.  The trouble is that there
are IO dependencies which you don't know about until after the first
IO has completed.  For example, in journaling you may be allocating
journal blocks on demand, and you don't know where the journal commit
block will be until you have written most of the rest of the
transaction out.  If you are doing deferred allocation of disk blocks,
then you can't even _start_ the dependent IO trail until you
explicitly tell the filesystem that the flush-to-disk is beginning.

You need a way to let the filesystem know that you want something in
the cache to be written to disk.  You don't want to presume that one
general-purpose ordering mechanism will work.

--Stephen



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-06 Thread Stephen C. Tweedie

Hi,

On Thu, 23 Dec 1999 02:37:48 +0300, Hans Reiser <[EMAIL PROTECTED]>
said:

>> > I completly agree to change mark_buffer_dirty() to call balance_dirty()
>> > before returning.

> How can we use a mark_buffer_dirty that calls balance_dirty in a
> place where we cannot call balance_dirty?

It shouldn't be impossible: as long as we are protected against
recursive invocations of balance_dirty (which should be easy to
arrange) we should be safe enough, at least if the memory reservation
bits of the VM/fs interaction are working so that the balance_dirty
can guarantee to run to completion.

--Stephen



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it)

2000-01-06 Thread Hans Reiser

Tigran Aivazian wrote:

> On Wed, 5 Jan 2000, Peter J. Braam wrote:
> > I think I mean joining.  What I need is:
> >
> >  braam starts trans
> >does A
> >calls reiser: hans starts
> >does B
> >hans commits; nothing goes to disk yet
> >braam does C
> > braam commits/aborts ABC now go or don't
>
> no, that definitely looks like nesting to me.
>
> Tigran.

It looks like joining to me.  If it was nesting, you would be able to commit A
without comitting B.

Of course, if there is database literature defining nesting, and there probably
is, then I should be ignored here.
Perhaps the literature defines nesting as equivalent to what I call joining.

Hans

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my ISP probably lost it)

2000-01-06 Thread Hans Reiser

Yes, but not before 2.5.  Chris and I have already discussed that it would be
nice to make the transaction API available to user space, but we haven't done any
work on it, or even specified the user API.  We probably won't even start work on
it for 6 months (unless a sponsor asks for it).  We do think it is a good idea.

Hans

"Peter J. Braam" wrote:

> I think I mean joining.  What I need is:
>
>  braam starts trans
>does A
>calls reiser: hans starts
>does B
>hans commits; nothing goes to disk yet
>braam does C
> braam commits/aborts ABC now go or don't
>
> - Peter -
>
> On Wed, 5 Jan 2000, Hans Reiser wrote:
>
> > Is nesting really the term you mean to use here, or is joining the term you
> > mean?
> >
> > Do you really mean transactions within other transactions?
> >
> > Exactly what functionality do you need?
> >
> > Hans
> >
> > "Peter J. Braam" wrote:
> >
> > > Hi,
> > >
> > > I have one request for the journal API for use by network file systems -
> > > it is a request of a slightly different nature than the ones discussed so
> > > far.
> > >
> > > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
> > > a cache and wraps around it. (Any disk file system can be used, but so far
> > > only Ext2 has been exploited.)  High availability file systems need update
> > > logs of changes that were made to the cache so that these may be
> > > propagated to peers when they come back online (to support "disconnected
> > > operation").
> > >
> > > Requested feature:
> > > 
> > >
> > > Stephen's journal API has a tremendously useful feature: it allows nesting
> > > of transactions.   I don't know if Reiser has this (can you tell me
> > > Chris?) but it is _incredibly_ useful.  So:
> > >
> > > - InterMezzo can start a journal transaction
> > >  - execute the underlying Ext3 routine within that transaction
> > >(i.e. the Ext3 transaction becomes part of the one started
> > > by InterMezzo)
> > > - InterMezzo finishes its routine (e.g. by noting that an update
> > > took place in its update log) and commits or aborts the transaction
> > >
> > > -
> > >
> > > [So, in particular InterMezzo and Ext3 share the journal transaction log.]
> > >
> > > Why is this useful? There are at least two reasons:
> > >
> > >  - the update InterMezzo update log can be kept in sync with the Ext3 file
> > > system as a cache
> > >
> > >  - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
> > > remmeber a global file identifier, similar to a Coda FID or NFS file
> > > handle) and it can make updates to its metadata atomically with updates
> > > made to Ext3 metadata.
> > >
> > > Both of these reasons touch the core architectural decisions of systems
> > > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
> > > be so delighted with what one can do with Stephen's API.
> > >
> > > Presently, systems like Coda and AFS have a hell of a time keeping caches
> > > in sync with the metadata and to a large extent Coda's really bad
> > > performance is caused by this (an external transaction system is used in
> > > conjunction with synchronous operations on the disk file system, ouch...).
> > > InterMezzo will start using the kernel journal facility that should be
> > > much lighter weight.
> > >
> > > Is this a reasonable thing to ask for?
> > >
> > > - Peter -
> >
> > --
> > Get Linux (http://www.kernel.org) plus ReiserFS
> >  (http://devlinux.org/namesys).  If you sell an OS or
> > internet appliance, buy a port of ReiserFS!  If you
> > need customizations and industrial grade support, we sell them.
> >

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending becausemy ISP probably lost it)

2000-01-05 Thread Chris Mason



On Wed, 5 Jan 2000, Peter J. Braam wrote:

> I think I mean joining.  What I need is:
>   
>  braam starts trans
>does A
>calls reiser: hans starts
>does B
>hans commits; nothing goes to disk yet
>braam does C
> braam commits/aborts ABC now go or don't
> 
> 
Reiserfs won't do this kind of nesting right now, we also don't have a
transaction abort (aside from crashing the machine).  These can be added
to a future version, but would you mind explaining your transaction needs
in more detail (offline) so I can get a better idea of what you are
looking for?

-chris

> - Peter -
> 
> On Wed, 5 Jan 2000, Hans Reiser wrote:
> 
> > Is nesting really the term you mean to use here, or is joining the term you
> > mean?
> > 
> > Do you really mean transactions within other transactions?
> > 
> > Exactly what functionality do you need?
> > 
> > Hans
> > 
> > "Peter J. Braam" wrote:
> > 
> > > Hi,
> > >
> > > I have one request for the journal API for use by network file systems -
> > > it is a request of a slightly different nature than the ones discussed so
> > > far.
> > >
> > > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
> > > a cache and wraps around it. (Any disk file system can be used, but so far
> > > only Ext2 has been exploited.)  High availability file systems need update
> > > logs of changes that were made to the cache so that these may be
> > > propagated to peers when they come back online (to support "disconnected
> > > operation").
> > >
> > > Requested feature:
> > > 
> > >
> > > Stephen's journal API has a tremendously useful feature: it allows nesting
> > > of transactions.   I don't know if Reiser has this (can you tell me
> > > Chris?) but it is _incredibly_ useful.  So:
> > >
> > > - InterMezzo can start a journal transaction
> > >  - execute the underlying Ext3 routine within that transaction
> > >(i.e. the Ext3 transaction becomes part of the one started
> > > by InterMezzo)
> > > - InterMezzo finishes its routine (e.g. by noting that an update
> > > took place in its update log) and commits or aborts the transaction
> > >
> > > -
> > >
> > > [So, in particular InterMezzo and Ext3 share the journal transaction log.]
> > >
> > > Why is this useful? There are at least two reasons:
> > >
> > >  - the update InterMezzo update log can be kept in sync with the Ext3 file
> > > system as a cache
> > >
> > >  - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
> > > remmeber a global file identifier, similar to a Coda FID or NFS file
> > > handle) and it can make updates to its metadata atomically with updates
> > > made to Ext3 metadata.
> > >
> > > Both of these reasons touch the core architectural decisions of systems
> > > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
> > > be so delighted with what one can do with Stephen's API.
> > >
> > > Presently, systems like Coda and AFS have a hell of a time keeping caches
> > > in sync with the metadata and to a large extent Coda's really bad
> > > performance is caused by this (an external transaction system is used in
> > > conjunction with synchronous operations on the disk file system, ouch...).
> > > InterMezzo will start using the kernel journal facility that should be
> > > much lighter weight.
> > >
> > > Is this a reasonable thing to ask for?
> > >
> > > - Peter -
> > 
> > --
> > Get Linux (http://www.kernel.org) plus ReiserFS
> >  (http://devlinux.org/namesys).  If you sell an OS or
> > internet appliance, buy a port of ReiserFS!  If you
> > need customizations and industrial grade support, we sell them.
> > 
> 



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it)

2000-01-05 Thread Tigran Aivazian

On Wed, 5 Jan 2000, Peter J. Braam wrote:
> I think I mean joining.  What I need is:
>   
>  braam starts trans
>does A
>calls reiser: hans starts
>does B
>hans commits; nothing goes to disk yet
>braam does C
> braam commits/aborts ABC now go or don't

no, that definitely looks like nesting to me.

Tigran.



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my ISP probably lost it)

2000-01-05 Thread Peter J. Braam

I think I mean joining.  What I need is:
  
 braam starts trans
   does A
   calls reiser: hans starts
   does B
   hans commits; nothing goes to disk yet
   braam does C
braam commits/aborts ABC now go or don't


- Peter -

On Wed, 5 Jan 2000, Hans Reiser wrote:

> Is nesting really the term you mean to use here, or is joining the term you
> mean?
> 
> Do you really mean transactions within other transactions?
> 
> Exactly what functionality do you need?
> 
> Hans
> 
> "Peter J. Braam" wrote:
> 
> > Hi,
> >
> > I have one request for the journal API for use by network file systems -
> > it is a request of a slightly different nature than the ones discussed so
> > far.
> >
> > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
> > a cache and wraps around it. (Any disk file system can be used, but so far
> > only Ext2 has been exploited.)  High availability file systems need update
> > logs of changes that were made to the cache so that these may be
> > propagated to peers when they come back online (to support "disconnected
> > operation").
> >
> > Requested feature:
> > 
> >
> > Stephen's journal API has a tremendously useful feature: it allows nesting
> > of transactions.   I don't know if Reiser has this (can you tell me
> > Chris?) but it is _incredibly_ useful.  So:
> >
> > - InterMezzo can start a journal transaction
> >  - execute the underlying Ext3 routine within that transaction
> >(i.e. the Ext3 transaction becomes part of the one started
> > by InterMezzo)
> > - InterMezzo finishes its routine (e.g. by noting that an update
> > took place in its update log) and commits or aborts the transaction
> >
> > -
> >
> > [So, in particular InterMezzo and Ext3 share the journal transaction log.]
> >
> > Why is this useful? There are at least two reasons:
> >
> >  - the update InterMezzo update log can be kept in sync with the Ext3 file
> > system as a cache
> >
> >  - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
> > remmeber a global file identifier, similar to a Coda FID or NFS file
> > handle) and it can make updates to its metadata atomically with updates
> > made to Ext3 metadata.
> >
> > Both of these reasons touch the core architectural decisions of systems
> > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
> > be so delighted with what one can do with Stephen's API.
> >
> > Presently, systems like Coda and AFS have a hell of a time keeping caches
> > in sync with the metadata and to a large extent Coda's really bad
> > performance is caused by this (an external transaction system is used in
> > conjunction with synchronous operations on the disk file system, ouch...).
> > InterMezzo will start using the kernel journal facility that should be
> > much lighter weight.
> >
> > Is this a reasonable thing to ask for?
> >
> > - Peter -
> 
> --
> Get Linux (http://www.kernel.org) plus ReiserFS
>  (http://devlinux.org/namesys).  If you sell an OS or
> internet appliance, buy a port of ReiserFS!  If you
> need customizations and industrial grade support, we sell them.
> 



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending old email lost by (former) ISP)

2000-01-05 Thread Hans Reiser

Erez Zadok wrote:

>
> Hans and linux-fsdevel folks: I have a proposal.  How would you all feel
> forming an informal group that would report changes relevant to f/s
> developers on this list.  (Maybe even a different mailing list?)

I think that sending emails summarizing changes to the kernel other FS
developers need to know about to
this mailing list is a reasonable idea.  Of course, I also like comments in the
code explaining the
concepts each function embodies, so I know that I am on the fringe

Hans


--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it)

2000-01-05 Thread Hans Reiser

Is nesting really the term you mean to use here, or is joining the term you
mean?

Do you really mean transactions within other transactions?

Exactly what functionality do you need?

Hans

"Peter J. Braam" wrote:

> Hi,
>
> I have one request for the journal API for use by network file systems -
> it is a request of a slightly different nature than the ones discussed so
> far.
>
> InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
> a cache and wraps around it. (Any disk file system can be used, but so far
> only Ext2 has been exploited.)  High availability file systems need update
> logs of changes that were made to the cache so that these may be
> propagated to peers when they come back online (to support "disconnected
> operation").
>
> Requested feature:
> 
>
> Stephen's journal API has a tremendously useful feature: it allows nesting
> of transactions.   I don't know if Reiser has this (can you tell me
> Chris?) but it is _incredibly_ useful.  So:
>
> - InterMezzo can start a journal transaction
>  - execute the underlying Ext3 routine within that transaction
>(i.e. the Ext3 transaction becomes part of the one started
> by InterMezzo)
> - InterMezzo finishes its routine (e.g. by noting that an update
> took place in its update log) and commits or aborts the transaction
>
> -
>
> [So, in particular InterMezzo and Ext3 share the journal transaction log.]
>
> Why is this useful? There are at least two reasons:
>
>  - the update InterMezzo update log can be kept in sync with the Ext3 file
> system as a cache
>
>  - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
> remmeber a global file identifier, similar to a Coda FID or NFS file
> handle) and it can make updates to its metadata atomically with updates
> made to Ext3 metadata.
>
> Both of these reasons touch the core architectural decisions of systems
> like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
> be so delighted with what one can do with Stephen's API.
>
> Presently, systems like Coda and AFS have a hell of a time keeping caches
> in sync with the metadata and to a large extent Coda's really bad
> performance is caused by this (an external transaction system is used in
> conjunction with synchronous operations on the disk file system, ouch...).
> InterMezzo will start using the kernel journal facility that should be
> much lighter weight.
>
> Is this a reasonable thing to ask for?
>
> - Peter -

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



Re: kernel change logs (was Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?)

1999-12-26 Thread Tigran Aivazian

Hi guys,

Although I received a few nice replies privately, I think it makes sense
to clarify what I meant when I said "Jeff's comments irritate me".

I meant that we should put a lot more work into writing kernel
commentaries like the excellent one started by Neil Brown on VFS and nfsd
already (don't have URL handy). I would love to see something of the size
of Encyclopaedia Britanica that documents every single line of kernel
code, including drivers and is always uptodate despite those lines
changing every second. Titanic job isn't it? But accepting titanic jobs is
more honourable (at least in my eyes) than saying "grepping patches ain't
difficult".

I agree with Jeff that notifying individuals is impossible but helping to
dissemenate the knowledge by means of kernel internals docs is the way to
go.

Regards,
Tigran.

On Sun, 26 Dec 1999, Erez Zadok wrote:

> In message <[EMAIL PROTECTED]>, 
>Jeff Garzik writes:
> > 
> [...]
> > To sum, documenting changes is a very good idea, notifying specific
> > hackers of specific kernel changes is a waste of time [unless they
> > are the maintainers of the code being changed, of course].
> 
> I agree that notifying individuals doesn't scale.  Notifying the list as a
> whole, does.
> 
> > Jeff
> 
> Erez.
> 



Re: kernel change logs (was Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?)

1999-12-26 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, 
Jeff Garzik writes:
> 
[...]
> To sum, documenting changes is a very good idea, notifying specific
> hackers of specific kernel changes is a waste of time [unless they
> are the maintainers of the code being changed, of course].

I agree that notifying individuals doesn't scale.  Notifying the list as a
whole, does.

>   Jeff

Erez.



kernel change logs (was Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?)

1999-12-26 Thread Jeff Garzik


(copied to linux-kernel)

On Sun, 26 Dec 1999, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, 
>Jeff Garzik writes:
> > On Thu, 23 Dec 1999, Hans Reiser wrote:
> > > All I'm going to ask is that if mark_buffer_dirty gets changed again,
> > > whoever changes it please let us know this time.  The last two times
> > > it was changed we weren't informed, and the first time it happened it
> > > took a long time to figure it out.

> > Can't you figure this sort of thing out on your own?  Generally if you
> > want to stay updated on something, you are the one who needs to do the
> > legwork.  And grep'ing patches ain't that hard

> Jeff, Hans is absolutely right.

So, you are accepting the job of notifying Hans each time
mark_buffer_dirty changes?   ;-)

Hans is not right, because the request does not scale.  I would love to
be notified whenever drivers/video changes, for example, but I'm sure
Geert and Linus have better things to do with their time.

A small Perl script usings ctags and grep (or other means) can get
you a list of functions changed in each release.


> In my case (stackable f/s), every time there's a change to
> anything under linux/fs, linux/mm, or headers, I've got to find out what
> changed and how it affected my code.

Any change of the kernel core requires analysis and testing in order
to determine the effects on other code.  E-mail notification doesn't
change that.


> There is no ChangeLog[...]

> Hans and linux-fsdevel folks: I have a proposal.  How would you all feel
> forming an informal group that would report changes relevant to f/s
> developers on this list.  (Maybe even a different mailing list?)  I'm
[...]
> Comments?

In the past, I have publicly and privately argued for maintained
ChangeLogs in the kernel.  There are so many advantages, especially
when hacking up old unmaintained code.  There have been several cases
of hackers duplicating old (but buggy) submissions, which could have
been avoided had they read a well-maintained ChangeLog.  Those who
ignore (or are ignorant of, in this case) history are doomed to
repeat it. :)

Linux is getting enough attention and eyes that I think ChangeLogs
would be of immense value.  Many people read and learn from the
kernel code -- and even more knowledge can be gleaned from reading
ChangeLogs sometimes.  But none of the people who write most of the
code indicated any interest.  gcc project requires a ChangeLog entry
with each submission, something I would _love_ to see.  But that
requires Linus intervention.  And that requires convincing Linus,
Alan, DaveM, Al Viro, and other submitters of large patches to agree
to write ChangeLog entries.

Without such a requirement, partially maintained ChangeLogs have
even less advantage over no ChangeLogs at all -- in this case, no
docs would be better than wrong docs IMNSHO.

As to your suggestion, a group of people posting VFS changelogs -- more
power to you!  It's better than nothing.  Just make sure such ChangeLogs
are actively maintained, if they ever make it off the mailing list and
into the kernel sources.

To sum, documenting changes is a very good idea, notifying specific
hackers of specific kernel changes is a waste of time [unless they
are the maintainers of the code being changed, of course].

Jeff






Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-26 Thread feiliu

May I ask why the time is O(N*Log(N)) instead of O(Log(N)). We have this
interesting OS class implementing a AVL tree structured directory entry in
ext2 directory file on disk. I always think it is not going to work out.
But the TA and the professor keep telling me the new file system will be
better than ext2 bcause now we have O(Log(N)) time search(ok),
insert/removal(???). I really doubt it but I do not know where they can be
wrong.

besides, how can one join this [EMAIL PROTECTED] email list?
I did not find a place having instruction on doing it.
Fei


 *~+_+~~~*
 *  Email:[EMAIL PROTECTED] | WWW:   http://aa.eps.jhu.edu/~feiliu*
  *  (410)889-9876(H)  | Johns Hopkins Univ. | (410)516-7047(O) *
   *---+_+-*

On Thu, 23 Dec 1999, Andrea Arcangeli wrote:

> On Wed, 22 Dec 1999, William J. Earl wrote:
> 
> >in the extent.  If the page cache were indexed by a per-inode AVL tree
> 
> Some month ago I did some research in putting the pagecache into a
> per-inode RB-tree. AVL would be overkill because insert/removal can be the
> only operation done on the tree (with cache pollution going on).
> 
> Unfortunately if the inode size gets very large the RB-tree won't scale
> :(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of
> the complexity paying with memory", while with an rbtree you have to
> always pay O(N*log(N)) for each query/insert/removal... Chuck's  bench
> generated nice numbers with the pagecache in the per-inode RB though
> (without considering your "ordering" needs of course).
> 
> The interesting code should be here (or nearby, just search for the
> filename in the ftp area if it's not exactly there):
> 
>   ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2
> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [EMAIL PROTECTED]  For more info on Linux MM,
> see: http://www.nl.linux.org/Linux-MM/
> 



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-26 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, 
Jeff Garzik writes:
> On Thu, 23 Dec 1999, Hans Reiser wrote:

> > All I'm going to ask is that if mark_buffer_dirty gets changed again,
> > whoever changes it please let us know this time.  The last two times
> > it was changed we weren't informed, and the first time it happened it
> > took a long time to figure it out.
>
> Can't you figure this sort of thing out on your own?  Generally if you
> want to stay updated on something, you are the one who needs to do the
> legwork.  And grep'ing patches ain't that hard
> 
>   Jeff

Jeff, Hans is absolutely right.

We can all figure it out on our own, and waste many hours re-discovering
that which others have discovered independently.  It's a royal pain and time
sink.  I'd rather write new code than try to figure out what's changed b/t
kernel versions.  In my case (stackable f/s), every time there's a change to
anything under linux/fs, linux/mm, or headers, I've got to find out what
changed and how it affected my code.  It's NOT enough to grep the patches.
Union diffs don't give you enough of a context of difference that's
meaningful to understanding the overall changes that were made.  I have to
use emacs's ediff or other methods to find out the meaning and motivation
behind the change.

There is no NEWS file for each release.

There is no ChangeLog for each release.  Actually there are a few ChangeLog
files sprinkled around the sources.  The last linux-2.3.25/fs/ChangeLog was
updated was 1998.

There is no one who summarizes kernel changes.  A long time ago, someone
used to.  I don't remember his name.  Is he still doing that?

I maintain a much smaller package (am-utils) and there's no way I could
remember what changes I've made throughout the years.  That's why I keep a
details ChangeLog and NEWS files w/ my releases.  I realize the linux kernel
is a much bigger and complex beast, but shouldn't that be a bigger
motivation for everyone to keep ChangeLogs?  IMHO, if we want to speed linux
development along, we should help the documentation of linux.


Hans and linux-fsdevel folks: I have a proposal.  How would you all feel
forming an informal group that would report changes relevant to f/s
developers on this list.  (Maybe even a different mailing list?)  I'm
willing to take the time to report whatever VFS changes I find each time I
update my stackable f/s code for a new kernel, including when no relevant
changes are made (which IMHO is just as important).  This effort would help
all of us f/s developers, but only if we each take the time to report our
findings to this list.  The few minutes each person takes to report their
findings as they relate to their f/s, will save numerous other people many
hours; overall this would help everyone.  We can also make it easy to find
these messages in the archives, so we can make the Subject of such messages
a grep-able format---say,

CHANGE 2.3.17-2.3.18: vm_area_struct->vm_pte renamed vm_private_data


Comments?

Erez.



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-24 Thread afei

May I ask why the time is O(N*Log(N)) instead of O(Log(N)). We have this
interesting OS class implementing a AVL tree structured directory entry in
ext2 directory file on disk. I always think it is not going to work out.
But the TA and the professor keep telling me the new file system will be
better than ext2 bcause now we have O(Log(N)) time search(ok),
insert/removal(???). I really doubt it but I do not know where they can be
wrong.

Fei

 On Thu, 23 Dec 1999, Andrea Arcangeli wrote:

> On Wed, 22 Dec 1999, William J. Earl wrote:
> 
> >in the extent.  If the page cache were indexed by a per-inode AVL tree
> 
> Some month ago I did some research in putting the pagecache into a
> per-inode RB-tree. AVL would be overkill because insert/removal can be the
> only operation done on the tree (with cache pollution going on).
> 
> Unfortunately if the inode size gets very large the RB-tree won't scale
> :(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of
> the complexity paying with memory", while with an rbtree you have to
> always pay O(N*log(N)) for each query/insert/removal... Chuck's  bench
> generated nice numbers with the pagecache in the per-inode RB though
> (without considering your "ordering" needs of course).
> 
> The interesting code should be here (or nearby, just search for the
> filename in the ftp area if it's not exactly there):
> 
>   ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2
> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [EMAIL PROTECTED]  For more info on Linux MM,
> see: http://www.nl.linux.org/Linux-MM/
> 



[OT] Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-24 Thread Tigran Aivazian

On Thu, 23 Dec 1999, Jeff Garzik wrote:
> Can't you figure this sort of thing out on your own?
..
> And grep'ing patches ain't that hard

Jeff, with all respect to your great kernel hacking talents - these sort
of comments really irritate me. Most (I assume) kernel hackers have
full-time jobs which have nothing to do with Linux - only few chosen are
lucky to work fulltime on Linux kernel hacking. I spend 0 minutes a day on
Linux and still attempt to contribute something useful (see my patches
site, most of which are obsoleted by acceptance).

Greping patches is not hard unless one does it after one comes back home
tired in the evening...

So, the moral of the story is - Hans is right.

Regards,
--
Tigran A. Aivazian   | http://www.sco.com
Escalations Research Group   | tel: +44-(0)1923-813796
Santa Cruz Operation Ltd | http://www.ocston.org/~tigran



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-23 Thread Jeff Garzik

On Thu, 23 Dec 1999, Hans Reiser wrote:
> All I'm going to ask is that if mark_buffer_dirty gets changed again, whoever
> changes it please let us know this time.  The last two times it was changed
> we weren't informed, and the first time it happened it took a long time to
> figure it out.

Can't you figure this sort of thing out on your own?  Generally if you
want to stay updated on something, you are the one who needs to do the
legwork.  And grep'ing patches ain't that hard

Jeff






Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-23 Thread Hans Reiser

All I'm going to ask is that if mark_buffer_dirty gets changed again, whoever
changes it please let us know this time.  The last two times it was changed
we weren't informed, and the first time it happened it took a long time to
figure it out.

I think that whether we make __mark_buffer_dirty or mark_buffer_dirty schedule
free is an argument over whether to name a function half-full or half-empty.  I
yield to both sides.

Hans

Andrea Arcangeli wrote:

> On Thu, 23 Dec 1999, Hans Reiser wrote:
>
> >If reiserfs had good SMP, you could stall it anywhere, and the code
> >could handle that.  But we don't, and I bet others also don't, and we
> >won't have it for some time even though we are working on it.
>
> I completly understand that we need also an atomic mark_buffer_dirty and
> to call buffer_dirty from some other place.
>
> But IMHO there's no one good reason to break all the old rock solid
> filesystems like ext2 just because there's the need of a new feature.
>
> I am not proposing to not provide a way to atomically marking a buffer
> dirty. I propose only to not change the semantic of the function called
> `mark_buffer_dirty()' as it happened now.
>
> If you want the atomic version just recall __mark_buffer_dirty() and use
> balance_dirty() by hand as soon as you can (after releasing your SMP
> locks).
>
> We can trivially replace mark_buffer_dirty() with __mark_buffer_dirty()
> with an automated script inside smart/SMP filesystems that wants to
> continue to use the current 2.3.x semantic of mark_buffer_dirty().
>
> Andrea

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-23 Thread Andrea Arcangeli

On Thu, 23 Dec 1999, Hans Reiser wrote:

>If reiserfs had good SMP, you could stall it anywhere, and the code
>could handle that.  But we don't, and I bet others also don't, and we
>won't have it for some time even though we are working on it.

I completly understand that we need also an atomic mark_buffer_dirty and
to call buffer_dirty from some other place.

But IMHO there's no one good reason to break all the old rock solid
filesystems like ext2 just because there's the need of a new feature.

I am not proposing to not provide a way to atomically marking a buffer
dirty. I propose only to not change the semantic of the function called
`mark_buffer_dirty()' as it happened now.

If you want the atomic version just recall __mark_buffer_dirty() and use
balance_dirty() by hand as soon as you can (after releasing your SMP
locks).

We can trivially replace mark_buffer_dirty() with __mark_buffer_dirty()
with an automated script inside smart/SMP filesystems that wants to
continue to use the current 2.3.x semantic of mark_buffer_dirty().

Andrea



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-23 Thread Andrea Arcangeli

On Wed, 22 Dec 1999, William J. Earl wrote:

>in the extent.  If the page cache were indexed by a per-inode AVL tree

Some month ago I did some research in putting the pagecache into a
per-inode RB-tree. AVL would be overkill because insert/removal can be the
only operation done on the tree (with cache pollution going on).

Unfortunately if the inode size gets very large the RB-tree won't scale
:(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of
the complexity paying with memory", while with an rbtree you have to
always pay O(N*log(N)) for each query/insert/removal... Chuck's  bench
generated nice numbers with the pagecache in the per-inode RB though
(without considering your "ordering" needs of course).

The interesting code should be here (or nearby, just search for the
filename in the ftp area if it's not exactly there):

ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2

Andrea



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-23 Thread Hans Reiser

"Benjamin C.R. LaHaise" wrote:

> I completly agree to change mark_buffer_dirty() to call balance_dirty()

> > before returning. But if you add the balance_dirty() calls all over the
> > right places all should be _just_ fine as far I can tell.
>
> I don't agree, both for the reasons above and because doing a
> balance_dirty in mark_buffer_dirty tends to result in stalls in the
> *wrong* place, because it tends to stall in the middle of an operation,
> not before it has begun.  You end up stalling on metadata operations that
> shouldn't stall.  The stall thresholds for data vs metadata have to be
> different in order to make the system 'feel' right.  This is easily
> accomplished by trying to "allocate" the dirty buffers before you actually
> dirty them (by checking if there's enough slack in the dirty buffer
> margins before doing the operation).
>
> -ben

If reiserfs had good SMP, you could stall it anywhere, and the code could handle
that.  But we don't, and I bet others also don't, and we won't have it for some
time even though we are working on it.

Hans

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-23 Thread Hans Reiser

Stephen's remarks seem right to me.

Hans


--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-23 Thread William J. Earl

Tan Pong Heng writes:
...
 > I was thinking that, unless you want to have FS specific buffer/page cache,
 > there is alway a gain for a unified cache for all fs. I think the one piece
 > of functionality missing from the 2.3 implementation is the dependency
 > between the various pages. If you could specify a tree relations between
 > the various subset of the buffer/page and the reclaim machanism honor
 > that everything should be fine. For FS that does not care about ordering,
 > they could simply ignore this capability and the machanism could assume
 > that everything is in one big set and could be reclaimed in any order.
...

  For the XFS port, we have been working on this, since XFS very much
wants to cluster logically adjacent delayed-allocation (and delayed-write) pages
together to optimize writes.  That is, if the someone who wants to write
back a dirty page to disk asks the file system to do so, then the file
system wants to find all nearby pages (nearby in the file, not necessarily
in memory).   The file system looks up the extent in which the page resides,
or allocates an extent if the page is part of a delayed allocation, and
then writes all of the pages in the extent at once.  Given the present
data structures, this is done by probing the page cache for each page
in the extent.  If the page cache were indexed by a per-inode AVL tree
(or other ordered index), then collecting adjacent pages would be cheaper.
Compared to a disk I/O, hash table probes are still relatively low in cost,
but it would be possible to do a bit better with some ordered index.



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-22 Thread Hans Reiser

"Stephen C. Tweedie" wrote:

> Hi,
>
> On Tue, 21 Dec 1999 11:18:03 +0100 (CET), Andrea Arcangeli
> <[EMAIL PROTECTED]> said:
>
> > On Tue, 21 Dec 1999, Stephen C. Tweedie wrote:
> >> refile_buffer() checks in buffer.c.  Ideally there should be a
> >> system-wide upper bound on dirty data: if each different filesystem
> >> starts to throttle writes at 50% of physical memory then you only
> >> need two different filesystems to overcommit your memory badly.
>
> > If all FSes shares the dirty list of buffer.c that's not true.

Stephen's global counter really would make things simpler to code.  I would also
like to see each filesystem able to specify a minimum amount it wants reserved
as clean pages, and have a global minimum that is the sum of all of these
amounts for all mounted filesystems.

>
>
> The entire point of this is that Linus has refused, point blank, to
> add the complexity of journaling to the buffer cache.  The journaling
> _has_ to be done independently, so we _have_ to have the dirty data
> for journal transactions kept outside of the buffer cache.
>
> We cannot use the buffer.c dirty list anyway because bdflush can write
> those buffers to disk at any time.  Transactions have to control the
> write ordering so we can only feed those writes into the buffer queues
> under strict control when we go to commit a transaction.
>
> > All normal filesystems are using the mark_buffer_dirty() in buffer.c
>
> We're not talking about normal filesystems. :)
>
> > so currently the 40% setting of bdflush is a system-wide number and
> > not a per-fs number.
>
> For filesystems that can use that mechanism, sure.  We need to be able
> to extend that mechanism so that filesystems with other writeback
> mechanisms can use it too.
>
> > If both ext3 and reiserfs are using refile_buffer and both are using
> > balance_dirty in the right places as Linus wants, all seems just fine to
> > me.
>
> They aren't and they can't.
>
> > I completly agree to change mark_buffer_dirty() to call balance_dirty()
> > before returning.
>
> Agreed.

How can we use a mark_buffer_dirty that calls balance_dirty in a place where we
cannot call balance_dirty?

>
>
> --Stephen

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-22 Thread Tan Pong Heng

"Stephen C. Tweedie" wrote:

> Hi,
>
> On Tue, 21 Dec 1999 20:21:05 -0500 (EST), "Benjamin C.R. LaHaise"
> <[EMAIL PROTECTED]> said:
>
> > The buffer dirty lists are the wrong place to be dealing with this.  We
> > need a lightweight, fast way of monitoring the system's dirty buffer/page
> > thresholds -- one that can be called for every write to a page or on the
> > write faults for cow pages.
>
> Precisely.  The only thing that the core VM needs to export is an atomic
> counter for such pages, a wait queue so that processes can wait for
> pages to be cleaned, and a function to be called to try to reclaim such
> pages.
>
> Remember, though, that we have three different types of page we need to
> deal with.  There are simple used pages, which we need to reclaim in a
> component-independent manner when we are using too much memory; then
> there are dirty pages which can be flushed to disk at any time; then
> there are reserved pages which cannot be flushed to disk without some
> extra work.
>
> The first case is simple: we already have the wait queues and reclaim
> functions in place, and all we need is an address_space callback to
> allow filesystem-specific caches to return pages when shrink_mmap()
> wants them.
>
> In the second case (dirty pages), bdflush already does some of the work,
> but we need a more generic solution of we want to support dirty data
> which is not stored in buffer_heads in a portable manner.
>
> The third case (reserved pages) is the case which doesn't affect any
> current code but which will become really important for journaled or
> deferred-allocation filesystems.
>
> --Stephen

Sorry for intruding, I have been monitoring this thread with interest.

I was thinking that, unless you want to have FS specific buffer/page cache,
there is alway a gain for a unified cache for all fs. I think the one piece
of functionality missing from the 2.3 implementation is the dependency
between the various pages. If you could specify a tree relations between
the various subset of the buffer/page and the reclaim machanism honor
that everything should be fine. For FS that does not care about ordering,
they could simply ignore this capability and the machanism could assume
that everything is in one big set and could be reclaimed in any order.

I have note been giving the complexity of implementing such functionality
a thought yet. But it seem to be feasible - since you would need to do that
any way for your FS



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-21 Thread Stephen C. Tweedie

Hi,

On Tue, 21 Dec 1999 14:57:29 +0100 (CET), Andrea Arcangeli
<[EMAIL PROTECTED]> said:

> So you are talking about replacing this line:
>   dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
> with:
>   dirty = (size_buffers_type[BUF_DIRTY]+size_buffers_type[BUF_PINNED]) >> 
>PAGE_SHIFT;

Basically yes, but I was envisaging something slightly different from
the above.

There may well be data which is simply not in the buffer cache at all
but which needs to be accounted for as pinned memory.  A good example
would be if some filesystem wants to implement deferred allocation of
disk blocks: the corresponding pages in the page cache obviously cannot
be flushed to disk without generating extra filesystem activity for the
allocation of disk blocks to pages.  The pages must therefore be pinned,
but as they don't yet have disk mappings we can't assume that they are
in the buffer cache.

So we really need a pinned page threshold which can apply to general
pages, not necessarily to the buffer cache.


There's another issue, though.  BUF_DIRTY buffers do not necessarily
count as pinned in this context: they can always be flushed to disk
without generating any significant new memory allocation pressure.  We
still need to do write-throttling, so we need a threshold on dirty data
for that reason.  However, deferred allocation and transactions actually
have a more subtle and nastier property: you cannot necessarily flush
the pages from memory without first allocating more memory.

In the transaction case this is because you have to allow transactions
which are already in progress to complete before you can commit the
transaction (you cannot commit incomplete transactions because that
would defeat the entire point of a transactional system!).  In the case
of deferred disk block allocation, the problem is that flushing the
dirty data requires extra filesystem operations as we allocate disk
blocks to pages.

In these cases we need to be able to make sure that not only does pinned
memory never exceed a threshold, we also have to ensure that the
*future* allocations required to flush the existing allocated memory can
also be satisfied.  We need to allow filesystems to "reserve" such extra
memory, and we need a system-wide threshold on all such reservations.

The ext3 journaling code already has support for reservations, but
that's currently a per-filesystem parameter.  We still have need for a
global VM reservation to prevent memory starvation if multiple different
filesystems have this behaviour.


Note that what we need here isn't complex: it's no more than exporting
atomic_t counts of the number of dirty and reserved pages in the system
and supporting a maximum threshold on these values via /proc.  The
mechanism for observing these limits can be local to each filesystem: as
long as there is an agreed counter in the VM where they can register
their use of memory.

--Stephen



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-21 Thread Andrea Arcangeli

On Tue, 21 Dec 1999, Stephen C. Tweedie wrote:

>We cannot use the buffer.c dirty list anyway because bdflush can write
>those buffers to disk at any time.  Transactions have to control the

So you are talking about replacing this line:

dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;

with:

dirty = (size_buffers_type[BUF_DIRTY]+size_buffers_type[BUF_PINNED]) >> 
PAGE_SHIFT;

If you don't do that you don't need _two_ filesystems to generate too many
dirty buffers but you can potentially go OOM with only one journaling
filesystem running. As you talked about a _two_ filesystem case generating
dirty buffers on 100% of memory I thought you was talking about something
very different than the above one liner. If you was talking about it
that's fine and I agree of course.

>We're not talking about normal filesystems. :)

With "normal" filesystems I meant filesystems that are _using_
linux/fs/buffer.c.

Andrea



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-21 Thread Stephen C. Tweedie

Hi,

On Tue, 21 Dec 1999 11:18:03 +0100 (CET), Andrea Arcangeli
<[EMAIL PROTECTED]> said:

> On Tue, 21 Dec 1999, Stephen C. Tweedie wrote:
>> refile_buffer() checks in buffer.c.  Ideally there should be a
>> system-wide upper bound on dirty data: if each different filesystem
>> starts to throttle writes at 50% of physical memory then you only
>> need two different filesystems to overcommit your memory badly.

> If all FSes shares the dirty list of buffer.c that's not true. 

The entire point of this is that Linus has refused, point blank, to
add the complexity of journaling to the buffer cache.  The journaling
_has_ to be done independently, so we _have_ to have the dirty data
for journal transactions kept outside of the buffer cache.

We cannot use the buffer.c dirty list anyway because bdflush can write
those buffers to disk at any time.  Transactions have to control the
write ordering so we can only feed those writes into the buffer queues
under strict control when we go to commit a transaction.  

> All normal filesystems are using the mark_buffer_dirty() in buffer.c

We're not talking about normal filesystems. :)

> so currently the 40% setting of bdflush is a system-wide number and
> not a per-fs number.

For filesystems that can use that mechanism, sure.  We need to be able
to extend that mechanism so that filesystems with other writeback
mechanisms can use it too.

> If both ext3 and reiserfs are using refile_buffer and both are using
> balance_dirty in the right places as Linus wants, all seems just fine to
> me.

They aren't and they can't.

> I completly agree to change mark_buffer_dirty() to call balance_dirty()
> before returning. 

Agreed.

--Stephen