Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
"Stephen C. Tweedie" wrote: > Hi, > > On Fri, 07 Jan 2000 00:32:48 +0300, Hans Reiser <[EMAIL PROTECTED]> said: > > > Andrea Arcangeli wrote: > >> BTW, I thought Hans was talking about places that can't sleep (because of > >> some not schedule-aware lock) when he said "place that cannot call > >> balance_dirty()". > > > You were correct. I think Stephen and I are missing in communicating here. > > Fine, I was just looking at it from the VFS point of view, not the > specific filesystem. In the worst case, a filesystem can always simply > defer marking the buffer as dirty until after the locking window has > passed, so there's obviously no fundamental problem with having a > blocking mark_buffer_dirty. If we want a non-blocking version too, with > the requirement that the filesystem then to a manual rebalance once it > is safe to do so, that will work fine too. > > --Stephen Yes, but then you have to track what you defer. Code complication. I just want to leave things as they are until we have time to do SMP right. When we do SMP right, then a mark_buffer_dirty() which causes schedule is not a problem. Let's deal with this in 2.5 Hans -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
On Fri, 7 Jan 2000, Stephen C. Tweedie wrote: >Fine, I was just looking at it from the VFS point of view, not the >specific filesystem. In the worst case, a filesystem can always simply >defer marking the buffer as dirty until after the locking window has >passed, so there's obviously no fundamental problem with having a >blocking mark_buffer_dirty. If we want a non-blocking version too, with >the requirement that the filesystem then to a manual rebalance once it >is safe to do so, that will work fine too. I did the new mark_buffer_dirty blocking and __mark_buffer_dirty nonblocking while fixing the 2.3.x buffer code. ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.36pre5/buffer-2.gz I am running with above applyed since some day on a based 2.3.36 on Alpha and all is worked fine so far under all kind of loads. Andrea
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
Hi, On Fri, 07 Jan 2000 00:32:48 +0300, Hans Reiser <[EMAIL PROTECTED]> said: > Andrea Arcangeli wrote: >> BTW, I thought Hans was talking about places that can't sleep (because of >> some not schedule-aware lock) when he said "place that cannot call >> balance_dirty()". > You were correct. I think Stephen and I are missing in communicating here. Fine, I was just looking at it from the VFS point of view, not the specific filesystem. In the worst case, a filesystem can always simply defer marking the buffer as dirty until after the locking window has passed, so there's obviously no fundamental problem with having a blocking mark_buffer_dirty. If we want a non-blocking version too, with the requirement that the filesystem then to a manual rebalance once it is safe to do so, that will work fine too. --Stephen
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my
Hi, On Thu, 6 Jan 2000 20:25:38 -0500 (EST), "Albert D. Cahalan" <[EMAIL PROTECTED]> said: > AIX has such an API already. It is good to clone if you can. The AIX API is much more than a simple small-operation atomic transaction API, isn't it? The filesystem transactions have many properties --- no abort, predictable size, short duration --- which make a journaling engine inappropriate for use in a general purpose user-visible transaction API. --Stephen
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my
Hans Reiser writes: > Yes, but not before 2.5. Chris and I have already discussed that > it would be nice to make the transaction API available to user space, > but we haven't done any work on it, or even specified the user API. AIX has such an API already. It is good to clone if you can. This ought to contain the API, but might require some digging: http://www.rs6000.ibm.com/support/
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
Andrea Arcangeli wrote: > BTW, I thought Hans was talking about places that can't sleep (because of > some not schedule-aware lock) when he said "place that cannot call > balance_dirty()". You were correct. I think Stephen and I are missing in communicating here. -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
BTW, I thought Hans was talking about places that can't sleep (because of some not schedule-aware lock) when he said "place that cannot call balance_dirty()". On Thu, 6 Jan 2000, Stephen C. Tweedie wrote: >It shouldn't be impossible: as long as we are protected against >recursive invocations of balance_dirty (which should be easy to I am not sure to understand correctly. In case the ll_rw_block layer produces dirty buffers we are protected by wakeup_bdflush that become a noop when recalled from kflushd (wakeup_bdflush is not blocking to avoid bdflush waiting bdflush :). And in genral balance_dirty should never recurse on the same stack. >arrange) we should be safe enough, at least if the memory reservation >bits of the VM/fs interaction are working so that the balance_dirty >can guarantee to run to completion. Hmm maybe you are talking about something else... Andrea
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
Hi, On Thu, 23 Dec 1999 06:41:44 +0800, Tan Pong Heng <[EMAIL PROTECTED]> said: > I was thinking that, unless you want to have FS specific buffer/page > cache, there is alway a gain for a unified cache for all fs. I think > the one piece of functionality missing from the 2.3 implementation > is the dependency between the various pages. If you could specify a > tree relations between the various subset of the buffer/page and the > reclaim machanism honor that everything should be fine. For FS that > does not care about ordering, they could simply ignore this > capability and the machanism could assume that everything is in one > big set and could be reclaimed in any order. That just doesn't give you enough power. The trouble is that there are IO dependencies which you don't know about until after the first IO has completed. For example, in journaling you may be allocating journal blocks on demand, and you don't know where the journal commit block will be until you have written most of the rest of the transaction out. If you are doing deferred allocation of disk blocks, then you can't even _start_ the dependent IO trail until you explicitly tell the filesystem that the flush-to-disk is beginning. You need a way to let the filesystem know that you want something in the cache to be written to disk. You don't want to presume that one general-purpose ordering mechanism will work. --Stephen
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
Hi, On Thu, 23 Dec 1999 02:37:48 +0300, Hans Reiser <[EMAIL PROTECTED]> said: >> > I completly agree to change mark_buffer_dirty() to call balance_dirty() >> > before returning. > How can we use a mark_buffer_dirty that calls balance_dirty in a > place where we cannot call balance_dirty? It shouldn't be impossible: as long as we are protected against recursive invocations of balance_dirty (which should be easy to arrange) we should be safe enough, at least if the memory reservation bits of the VM/fs interaction are working so that the balance_dirty can guarantee to run to completion. --Stephen
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it)
Tigran Aivazian wrote: > On Wed, 5 Jan 2000, Peter J. Braam wrote: > > I think I mean joining. What I need is: > > > > braam starts trans > >does A > >calls reiser: hans starts > >does B > >hans commits; nothing goes to disk yet > >braam does C > > braam commits/aborts ABC now go or don't > > no, that definitely looks like nesting to me. > > Tigran. It looks like joining to me. If it was nesting, you would be able to commit A without comitting B. Of course, if there is database literature defining nesting, and there probably is, then I should be ignored here. Perhaps the literature defines nesting as equivalent to what I call joining. Hans -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my ISP probably lost it)
Yes, but not before 2.5. Chris and I have already discussed that it would be nice to make the transaction API available to user space, but we haven't done any work on it, or even specified the user API. We probably won't even start work on it for 6 months (unless a sponsor asks for it). We do think it is a good idea. Hans "Peter J. Braam" wrote: > I think I mean joining. What I need is: > > braam starts trans >does A >calls reiser: hans starts >does B >hans commits; nothing goes to disk yet >braam does C > braam commits/aborts ABC now go or don't > > - Peter - > > On Wed, 5 Jan 2000, Hans Reiser wrote: > > > Is nesting really the term you mean to use here, or is joining the term you > > mean? > > > > Do you really mean transactions within other transactions? > > > > Exactly what functionality do you need? > > > > Hans > > > > "Peter J. Braam" wrote: > > > > > Hi, > > > > > > I have one request for the journal API for use by network file systems - > > > it is a request of a slightly different nature than the ones discussed so > > > far. > > > > > > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as > > > a cache and wraps around it. (Any disk file system can be used, but so far > > > only Ext2 has been exploited.) High availability file systems need update > > > logs of changes that were made to the cache so that these may be > > > propagated to peers when they come back online (to support "disconnected > > > operation"). > > > > > > Requested feature: > > > > > > > > > Stephen's journal API has a tremendously useful feature: it allows nesting > > > of transactions. I don't know if Reiser has this (can you tell me > > > Chris?) but it is _incredibly_ useful. So: > > > > > > - InterMezzo can start a journal transaction > > > - execute the underlying Ext3 routine within that transaction > > >(i.e. the Ext3 transaction becomes part of the one started > > > by InterMezzo) > > > - InterMezzo finishes its routine (e.g. by noting that an update > > > took place in its update log) and commits or aborts the transaction > > > > > > - > > > > > > [So, in particular InterMezzo and Ext3 share the journal transaction log.] > > > > > > Why is this useful? There are at least two reasons: > > > > > > - the update InterMezzo update log can be kept in sync with the Ext3 file > > > system as a cache > > > > > > - InterMezzo will soon manage somewhat more metadata (e.g. it may want to > > > remmeber a global file identifier, similar to a Coda FID or NFS file > > > handle) and it can make updates to its metadata atomically with updates > > > made to Ext3 metadata. > > > > > > Both of these reasons touch the core architectural decisions of systems > > > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to > > > be so delighted with what one can do with Stephen's API. > > > > > > Presently, systems like Coda and AFS have a hell of a time keeping caches > > > in sync with the metadata and to a large extent Coda's really bad > > > performance is caused by this (an external transaction system is used in > > > conjunction with synchronous operations on the disk file system, ouch...). > > > InterMezzo will start using the kernel journal facility that should be > > > much lighter weight. > > > > > > Is this a reasonable thing to ask for? > > > > > > - Peter - > > > > -- > > Get Linux (http://www.kernel.org) plus ReiserFS > > (http://devlinux.org/namesys). If you sell an OS or > > internet appliance, buy a port of ReiserFS! If you > > need customizations and industrial grade support, we sell them. > > -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending becausemy ISP probably lost it)
On Wed, 5 Jan 2000, Peter J. Braam wrote: > I think I mean joining. What I need is: > > braam starts trans >does A >calls reiser: hans starts >does B >hans commits; nothing goes to disk yet >braam does C > braam commits/aborts ABC now go or don't > > Reiserfs won't do this kind of nesting right now, we also don't have a transaction abort (aside from crashing the machine). These can be added to a future version, but would you mind explaining your transaction needs in more detail (offline) so I can get a better idea of what you are looking for? -chris > - Peter - > > On Wed, 5 Jan 2000, Hans Reiser wrote: > > > Is nesting really the term you mean to use here, or is joining the term you > > mean? > > > > Do you really mean transactions within other transactions? > > > > Exactly what functionality do you need? > > > > Hans > > > > "Peter J. Braam" wrote: > > > > > Hi, > > > > > > I have one request for the journal API for use by network file systems - > > > it is a request of a slightly different nature than the ones discussed so > > > far. > > > > > > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as > > > a cache and wraps around it. (Any disk file system can be used, but so far > > > only Ext2 has been exploited.) High availability file systems need update > > > logs of changes that were made to the cache so that these may be > > > propagated to peers when they come back online (to support "disconnected > > > operation"). > > > > > > Requested feature: > > > > > > > > > Stephen's journal API has a tremendously useful feature: it allows nesting > > > of transactions. I don't know if Reiser has this (can you tell me > > > Chris?) but it is _incredibly_ useful. So: > > > > > > - InterMezzo can start a journal transaction > > > - execute the underlying Ext3 routine within that transaction > > >(i.e. the Ext3 transaction becomes part of the one started > > > by InterMezzo) > > > - InterMezzo finishes its routine (e.g. by noting that an update > > > took place in its update log) and commits or aborts the transaction > > > > > > - > > > > > > [So, in particular InterMezzo and Ext3 share the journal transaction log.] > > > > > > Why is this useful? There are at least two reasons: > > > > > > - the update InterMezzo update log can be kept in sync with the Ext3 file > > > system as a cache > > > > > > - InterMezzo will soon manage somewhat more metadata (e.g. it may want to > > > remmeber a global file identifier, similar to a Coda FID or NFS file > > > handle) and it can make updates to its metadata atomically with updates > > > made to Ext3 metadata. > > > > > > Both of these reasons touch the core architectural decisions of systems > > > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to > > > be so delighted with what one can do with Stephen's API. > > > > > > Presently, systems like Coda and AFS have a hell of a time keeping caches > > > in sync with the metadata and to a large extent Coda's really bad > > > performance is caused by this (an external transaction system is used in > > > conjunction with synchronous operations on the disk file system, ouch...). > > > InterMezzo will start using the kernel journal facility that should be > > > much lighter weight. > > > > > > Is this a reasonable thing to ask for? > > > > > > - Peter - > > > > -- > > Get Linux (http://www.kernel.org) plus ReiserFS > > (http://devlinux.org/namesys). If you sell an OS or > > internet appliance, buy a port of ReiserFS! If you > > need customizations and industrial grade support, we sell them. > > >
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it)
On Wed, 5 Jan 2000, Peter J. Braam wrote: > I think I mean joining. What I need is: > > braam starts trans >does A >calls reiser: hans starts >does B >hans commits; nothing goes to disk yet >braam does C > braam commits/aborts ABC now go or don't no, that definitely looks like nesting to me. Tigran.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my ISP probably lost it)
I think I mean joining. What I need is: braam starts trans does A calls reiser: hans starts does B hans commits; nothing goes to disk yet braam does C braam commits/aborts ABC now go or don't - Peter - On Wed, 5 Jan 2000, Hans Reiser wrote: > Is nesting really the term you mean to use here, or is joining the term you > mean? > > Do you really mean transactions within other transactions? > > Exactly what functionality do you need? > > Hans > > "Peter J. Braam" wrote: > > > Hi, > > > > I have one request for the journal API for use by network file systems - > > it is a request of a slightly different nature than the ones discussed so > > far. > > > > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as > > a cache and wraps around it. (Any disk file system can be used, but so far > > only Ext2 has been exploited.) High availability file systems need update > > logs of changes that were made to the cache so that these may be > > propagated to peers when they come back online (to support "disconnected > > operation"). > > > > Requested feature: > > > > > > Stephen's journal API has a tremendously useful feature: it allows nesting > > of transactions. I don't know if Reiser has this (can you tell me > > Chris?) but it is _incredibly_ useful. So: > > > > - InterMezzo can start a journal transaction > > - execute the underlying Ext3 routine within that transaction > >(i.e. the Ext3 transaction becomes part of the one started > > by InterMezzo) > > - InterMezzo finishes its routine (e.g. by noting that an update > > took place in its update log) and commits or aborts the transaction > > > > - > > > > [So, in particular InterMezzo and Ext3 share the journal transaction log.] > > > > Why is this useful? There are at least two reasons: > > > > - the update InterMezzo update log can be kept in sync with the Ext3 file > > system as a cache > > > > - InterMezzo will soon manage somewhat more metadata (e.g. it may want to > > remmeber a global file identifier, similar to a Coda FID or NFS file > > handle) and it can make updates to its metadata atomically with updates > > made to Ext3 metadata. > > > > Both of these reasons touch the core architectural decisions of systems > > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to > > be so delighted with what one can do with Stephen's API. > > > > Presently, systems like Coda and AFS have a hell of a time keeping caches > > in sync with the metadata and to a large extent Coda's really bad > > performance is caused by this (an external transaction system is used in > > conjunction with synchronous operations on the disk file system, ouch...). > > InterMezzo will start using the kernel journal facility that should be > > much lighter weight. > > > > Is this a reasonable thing to ask for? > > > > - Peter - > > -- > Get Linux (http://www.kernel.org) plus ReiserFS > (http://devlinux.org/namesys). If you sell an OS or > internet appliance, buy a port of ReiserFS! If you > need customizations and industrial grade support, we sell them. >
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending old email lost by (former) ISP)
Erez Zadok wrote: > > Hans and linux-fsdevel folks: I have a proposal. How would you all feel > forming an informal group that would report changes relevant to f/s > developers on this list. (Maybe even a different mailing list?) I think that sending emails summarizing changes to the kernel other FS developers need to know about to this mailing list is a reasonable idea. Of course, I also like comments in the code explaining the concepts each function embodies, so I know that I am on the fringe Hans -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it)
Is nesting really the term you mean to use here, or is joining the term you mean? Do you really mean transactions within other transactions? Exactly what functionality do you need? Hans "Peter J. Braam" wrote: > Hi, > > I have one request for the journal API for use by network file systems - > it is a request of a slightly different nature than the ones discussed so > far. > > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as > a cache and wraps around it. (Any disk file system can be used, but so far > only Ext2 has been exploited.) High availability file systems need update > logs of changes that were made to the cache so that these may be > propagated to peers when they come back online (to support "disconnected > operation"). > > Requested feature: > > > Stephen's journal API has a tremendously useful feature: it allows nesting > of transactions. I don't know if Reiser has this (can you tell me > Chris?) but it is _incredibly_ useful. So: > > - InterMezzo can start a journal transaction > - execute the underlying Ext3 routine within that transaction >(i.e. the Ext3 transaction becomes part of the one started > by InterMezzo) > - InterMezzo finishes its routine (e.g. by noting that an update > took place in its update log) and commits or aborts the transaction > > - > > [So, in particular InterMezzo and Ext3 share the journal transaction log.] > > Why is this useful? There are at least two reasons: > > - the update InterMezzo update log can be kept in sync with the Ext3 file > system as a cache > > - InterMezzo will soon manage somewhat more metadata (e.g. it may want to > remmeber a global file identifier, similar to a Coda FID or NFS file > handle) and it can make updates to its metadata atomically with updates > made to Ext3 metadata. > > Both of these reasons touch the core architectural decisions of systems > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to > be so delighted with what one can do with Stephen's API. > > Presently, systems like Coda and AFS have a hell of a time keeping caches > in sync with the metadata and to a large extent Coda's really bad > performance is caused by this (an external transaction system is used in > conjunction with synchronous operations on the disk file system, ouch...). > InterMezzo will start using the kernel journal facility that should be > much lighter weight. > > Is this a reasonable thing to ask for? > > - Peter - -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: kernel change logs (was Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?)
Hi guys, Although I received a few nice replies privately, I think it makes sense to clarify what I meant when I said "Jeff's comments irritate me". I meant that we should put a lot more work into writing kernel commentaries like the excellent one started by Neil Brown on VFS and nfsd already (don't have URL handy). I would love to see something of the size of Encyclopaedia Britanica that documents every single line of kernel code, including drivers and is always uptodate despite those lines changing every second. Titanic job isn't it? But accepting titanic jobs is more honourable (at least in my eyes) than saying "grepping patches ain't difficult". I agree with Jeff that notifying individuals is impossible but helping to dissemenate the knowledge by means of kernel internals docs is the way to go. Regards, Tigran. On Sun, 26 Dec 1999, Erez Zadok wrote: > In message <[EMAIL PROTECTED]>, >Jeff Garzik writes: > > > [...] > > To sum, documenting changes is a very good idea, notifying specific > > hackers of specific kernel changes is a waste of time [unless they > > are the maintainers of the code being changed, of course]. > > I agree that notifying individuals doesn't scale. Notifying the list as a > whole, does. > > > Jeff > > Erez. >
Re: kernel change logs (was Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?)
In message <[EMAIL PROTECTED]>, Jeff Garzik writes: > [...] > To sum, documenting changes is a very good idea, notifying specific > hackers of specific kernel changes is a waste of time [unless they > are the maintainers of the code being changed, of course]. I agree that notifying individuals doesn't scale. Notifying the list as a whole, does. > Jeff Erez.
kernel change logs (was Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?)
(copied to linux-kernel) On Sun, 26 Dec 1999, Erez Zadok wrote: > In message <[EMAIL PROTECTED]>, >Jeff Garzik writes: > > On Thu, 23 Dec 1999, Hans Reiser wrote: > > > All I'm going to ask is that if mark_buffer_dirty gets changed again, > > > whoever changes it please let us know this time. The last two times > > > it was changed we weren't informed, and the first time it happened it > > > took a long time to figure it out. > > Can't you figure this sort of thing out on your own? Generally if you > > want to stay updated on something, you are the one who needs to do the > > legwork. And grep'ing patches ain't that hard > Jeff, Hans is absolutely right. So, you are accepting the job of notifying Hans each time mark_buffer_dirty changes? ;-) Hans is not right, because the request does not scale. I would love to be notified whenever drivers/video changes, for example, but I'm sure Geert and Linus have better things to do with their time. A small Perl script usings ctags and grep (or other means) can get you a list of functions changed in each release. > In my case (stackable f/s), every time there's a change to > anything under linux/fs, linux/mm, or headers, I've got to find out what > changed and how it affected my code. Any change of the kernel core requires analysis and testing in order to determine the effects on other code. E-mail notification doesn't change that. > There is no ChangeLog[...] > Hans and linux-fsdevel folks: I have a proposal. How would you all feel > forming an informal group that would report changes relevant to f/s > developers on this list. (Maybe even a different mailing list?) I'm [...] > Comments? In the past, I have publicly and privately argued for maintained ChangeLogs in the kernel. There are so many advantages, especially when hacking up old unmaintained code. There have been several cases of hackers duplicating old (but buggy) submissions, which could have been avoided had they read a well-maintained ChangeLog. Those who ignore (or are ignorant of, in this case) history are doomed to repeat it. :) Linux is getting enough attention and eyes that I think ChangeLogs would be of immense value. Many people read and learn from the kernel code -- and even more knowledge can be gleaned from reading ChangeLogs sometimes. But none of the people who write most of the code indicated any interest. gcc project requires a ChangeLog entry with each submission, something I would _love_ to see. But that requires Linus intervention. And that requires convincing Linus, Alan, DaveM, Al Viro, and other submitters of large patches to agree to write ChangeLog entries. Without such a requirement, partially maintained ChangeLogs have even less advantage over no ChangeLogs at all -- in this case, no docs would be better than wrong docs IMNSHO. As to your suggestion, a group of people posting VFS changelogs -- more power to you! It's better than nothing. Just make sure such ChangeLogs are actively maintained, if they ever make it off the mailing list and into the kernel sources. To sum, documenting changes is a very good idea, notifying specific hackers of specific kernel changes is a waste of time [unless they are the maintainers of the code being changed, of course]. Jeff
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
May I ask why the time is O(N*Log(N)) instead of O(Log(N)). We have this interesting OS class implementing a AVL tree structured directory entry in ext2 directory file on disk. I always think it is not going to work out. But the TA and the professor keep telling me the new file system will be better than ext2 bcause now we have O(Log(N)) time search(ok), insert/removal(???). I really doubt it but I do not know where they can be wrong. besides, how can one join this [EMAIL PROTECTED] email list? I did not find a place having instruction on doing it. Fei *~+_+~~~* * Email:[EMAIL PROTECTED] | WWW: http://aa.eps.jhu.edu/~feiliu* * (410)889-9876(H) | Johns Hopkins Univ. | (410)516-7047(O) * *---+_+-* On Thu, 23 Dec 1999, Andrea Arcangeli wrote: > On Wed, 22 Dec 1999, William J. Earl wrote: > > >in the extent. If the page cache were indexed by a per-inode AVL tree > > Some month ago I did some research in putting the pagecache into a > per-inode RB-tree. AVL would be overkill because insert/removal can be the > only operation done on the tree (with cache pollution going on). > > Unfortunately if the inode size gets very large the RB-tree won't scale > :(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of > the complexity paying with memory", while with an rbtree you have to > always pay O(N*log(N)) for each query/insert/removal... Chuck's bench > generated nice numbers with the pagecache in the per-inode RB though > (without considering your "ordering" needs of course). > > The interesting code should be here (or nearby, just search for the > filename in the ftp area if it's not exactly there): > > ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2 > > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to [EMAIL PROTECTED] For more info on Linux MM, > see: http://www.nl.linux.org/Linux-MM/ >
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
In message <[EMAIL PROTECTED]>, Jeff Garzik writes: > On Thu, 23 Dec 1999, Hans Reiser wrote: > > All I'm going to ask is that if mark_buffer_dirty gets changed again, > > whoever changes it please let us know this time. The last two times > > it was changed we weren't informed, and the first time it happened it > > took a long time to figure it out. > > Can't you figure this sort of thing out on your own? Generally if you > want to stay updated on something, you are the one who needs to do the > legwork. And grep'ing patches ain't that hard > > Jeff Jeff, Hans is absolutely right. We can all figure it out on our own, and waste many hours re-discovering that which others have discovered independently. It's a royal pain and time sink. I'd rather write new code than try to figure out what's changed b/t kernel versions. In my case (stackable f/s), every time there's a change to anything under linux/fs, linux/mm, or headers, I've got to find out what changed and how it affected my code. It's NOT enough to grep the patches. Union diffs don't give you enough of a context of difference that's meaningful to understanding the overall changes that were made. I have to use emacs's ediff or other methods to find out the meaning and motivation behind the change. There is no NEWS file for each release. There is no ChangeLog for each release. Actually there are a few ChangeLog files sprinkled around the sources. The last linux-2.3.25/fs/ChangeLog was updated was 1998. There is no one who summarizes kernel changes. A long time ago, someone used to. I don't remember his name. Is he still doing that? I maintain a much smaller package (am-utils) and there's no way I could remember what changes I've made throughout the years. That's why I keep a details ChangeLog and NEWS files w/ my releases. I realize the linux kernel is a much bigger and complex beast, but shouldn't that be a bigger motivation for everyone to keep ChangeLogs? IMHO, if we want to speed linux development along, we should help the documentation of linux. Hans and linux-fsdevel folks: I have a proposal. How would you all feel forming an informal group that would report changes relevant to f/s developers on this list. (Maybe even a different mailing list?) I'm willing to take the time to report whatever VFS changes I find each time I update my stackable f/s code for a new kernel, including when no relevant changes are made (which IMHO is just as important). This effort would help all of us f/s developers, but only if we each take the time to report our findings to this list. The few minutes each person takes to report their findings as they relate to their f/s, will save numerous other people many hours; overall this would help everyone. We can also make it easy to find these messages in the archives, so we can make the Subject of such messages a grep-able format---say, CHANGE 2.3.17-2.3.18: vm_area_struct->vm_pte renamed vm_private_data Comments? Erez.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
May I ask why the time is O(N*Log(N)) instead of O(Log(N)). We have this interesting OS class implementing a AVL tree structured directory entry in ext2 directory file on disk. I always think it is not going to work out. But the TA and the professor keep telling me the new file system will be better than ext2 bcause now we have O(Log(N)) time search(ok), insert/removal(???). I really doubt it but I do not know where they can be wrong. Fei On Thu, 23 Dec 1999, Andrea Arcangeli wrote: > On Wed, 22 Dec 1999, William J. Earl wrote: > > >in the extent. If the page cache were indexed by a per-inode AVL tree > > Some month ago I did some research in putting the pagecache into a > per-inode RB-tree. AVL would be overkill because insert/removal can be the > only operation done on the tree (with cache pollution going on). > > Unfortunately if the inode size gets very large the RB-tree won't scale > :(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of > the complexity paying with memory", while with an rbtree you have to > always pay O(N*log(N)) for each query/insert/removal... Chuck's bench > generated nice numbers with the pagecache in the per-inode RB though > (without considering your "ordering" needs of course). > > The interesting code should be here (or nearby, just search for the > filename in the ftp area if it's not exactly there): > > ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2 > > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to [EMAIL PROTECTED] For more info on Linux MM, > see: http://www.nl.linux.org/Linux-MM/ >
[OT] Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
On Thu, 23 Dec 1999, Jeff Garzik wrote: > Can't you figure this sort of thing out on your own? .. > And grep'ing patches ain't that hard Jeff, with all respect to your great kernel hacking talents - these sort of comments really irritate me. Most (I assume) kernel hackers have full-time jobs which have nothing to do with Linux - only few chosen are lucky to work fulltime on Linux kernel hacking. I spend 0 minutes a day on Linux and still attempt to contribute something useful (see my patches site, most of which are obsoleted by acceptance). Greping patches is not hard unless one does it after one comes back home tired in the evening... So, the moral of the story is - Hans is right. Regards, -- Tigran A. Aivazian | http://www.sco.com Escalations Research Group | tel: +44-(0)1923-813796 Santa Cruz Operation Ltd | http://www.ocston.org/~tigran
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
On Thu, 23 Dec 1999, Hans Reiser wrote: > All I'm going to ask is that if mark_buffer_dirty gets changed again, whoever > changes it please let us know this time. The last two times it was changed > we weren't informed, and the first time it happened it took a long time to > figure it out. Can't you figure this sort of thing out on your own? Generally if you want to stay updated on something, you are the one who needs to do the legwork. And grep'ing patches ain't that hard Jeff
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
All I'm going to ask is that if mark_buffer_dirty gets changed again, whoever changes it please let us know this time. The last two times it was changed we weren't informed, and the first time it happened it took a long time to figure it out. I think that whether we make __mark_buffer_dirty or mark_buffer_dirty schedule free is an argument over whether to name a function half-full or half-empty. I yield to both sides. Hans Andrea Arcangeli wrote: > On Thu, 23 Dec 1999, Hans Reiser wrote: > > >If reiserfs had good SMP, you could stall it anywhere, and the code > >could handle that. But we don't, and I bet others also don't, and we > >won't have it for some time even though we are working on it. > > I completly understand that we need also an atomic mark_buffer_dirty and > to call buffer_dirty from some other place. > > But IMHO there's no one good reason to break all the old rock solid > filesystems like ext2 just because there's the need of a new feature. > > I am not proposing to not provide a way to atomically marking a buffer > dirty. I propose only to not change the semantic of the function called > `mark_buffer_dirty()' as it happened now. > > If you want the atomic version just recall __mark_buffer_dirty() and use > balance_dirty() by hand as soon as you can (after releasing your SMP > locks). > > We can trivially replace mark_buffer_dirty() with __mark_buffer_dirty() > with an automated script inside smart/SMP filesystems that wants to > continue to use the current 2.3.x semantic of mark_buffer_dirty(). > > Andrea -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
On Thu, 23 Dec 1999, Hans Reiser wrote: >If reiserfs had good SMP, you could stall it anywhere, and the code >could handle that. But we don't, and I bet others also don't, and we >won't have it for some time even though we are working on it. I completly understand that we need also an atomic mark_buffer_dirty and to call buffer_dirty from some other place. But IMHO there's no one good reason to break all the old rock solid filesystems like ext2 just because there's the need of a new feature. I am not proposing to not provide a way to atomically marking a buffer dirty. I propose only to not change the semantic of the function called `mark_buffer_dirty()' as it happened now. If you want the atomic version just recall __mark_buffer_dirty() and use balance_dirty() by hand as soon as you can (after releasing your SMP locks). We can trivially replace mark_buffer_dirty() with __mark_buffer_dirty() with an automated script inside smart/SMP filesystems that wants to continue to use the current 2.3.x semantic of mark_buffer_dirty(). Andrea
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
On Wed, 22 Dec 1999, William J. Earl wrote: >in the extent. If the page cache were indexed by a per-inode AVL tree Some month ago I did some research in putting the pagecache into a per-inode RB-tree. AVL would be overkill because insert/removal can be the only operation done on the tree (with cache pollution going on). Unfortunately if the inode size gets very large the RB-tree won't scale :(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of the complexity paying with memory", while with an rbtree you have to always pay O(N*log(N)) for each query/insert/removal... Chuck's bench generated nice numbers with the pagecache in the per-inode RB though (without considering your "ordering" needs of course). The interesting code should be here (or nearby, just search for the filename in the ftp area if it's not exactly there): ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2 Andrea
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
"Benjamin C.R. LaHaise" wrote: > I completly agree to change mark_buffer_dirty() to call balance_dirty() > > before returning. But if you add the balance_dirty() calls all over the > > right places all should be _just_ fine as far I can tell. > > I don't agree, both for the reasons above and because doing a > balance_dirty in mark_buffer_dirty tends to result in stalls in the > *wrong* place, because it tends to stall in the middle of an operation, > not before it has begun. You end up stalling on metadata operations that > shouldn't stall. The stall thresholds for data vs metadata have to be > different in order to make the system 'feel' right. This is easily > accomplished by trying to "allocate" the dirty buffers before you actually > dirty them (by checking if there's enough slack in the dirty buffer > margins before doing the operation). > > -ben If reiserfs had good SMP, you could stall it anywhere, and the code could handle that. But we don't, and I bet others also don't, and we won't have it for some time even though we are working on it. Hans -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
Stephen's remarks seem right to me. Hans -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
Tan Pong Heng writes: ... > I was thinking that, unless you want to have FS specific buffer/page cache, > there is alway a gain for a unified cache for all fs. I think the one piece > of functionality missing from the 2.3 implementation is the dependency > between the various pages. If you could specify a tree relations between > the various subset of the buffer/page and the reclaim machanism honor > that everything should be fine. For FS that does not care about ordering, > they could simply ignore this capability and the machanism could assume > that everything is in one big set and could be reclaimed in any order. ... For the XFS port, we have been working on this, since XFS very much wants to cluster logically adjacent delayed-allocation (and delayed-write) pages together to optimize writes. That is, if the someone who wants to write back a dirty page to disk asks the file system to do so, then the file system wants to find all nearby pages (nearby in the file, not necessarily in memory). The file system looks up the extent in which the page resides, or allocates an extent if the page is part of a delayed allocation, and then writes all of the pages in the extent at once. Given the present data structures, this is done by probing the page cache for each page in the extent. If the page cache were indexed by a per-inode AVL tree (or other ordered index), then collecting adjacent pages would be cheaper. Compared to a disk I/O, hash table probes are still relatively low in cost, but it would be possible to do a bit better with some ordered index.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
"Stephen C. Tweedie" wrote: > Hi, > > On Tue, 21 Dec 1999 11:18:03 +0100 (CET), Andrea Arcangeli > <[EMAIL PROTECTED]> said: > > > On Tue, 21 Dec 1999, Stephen C. Tweedie wrote: > >> refile_buffer() checks in buffer.c. Ideally there should be a > >> system-wide upper bound on dirty data: if each different filesystem > >> starts to throttle writes at 50% of physical memory then you only > >> need two different filesystems to overcommit your memory badly. > > > If all FSes shares the dirty list of buffer.c that's not true. Stephen's global counter really would make things simpler to code. I would also like to see each filesystem able to specify a minimum amount it wants reserved as clean pages, and have a global minimum that is the sum of all of these amounts for all mounted filesystems. > > > The entire point of this is that Linus has refused, point blank, to > add the complexity of journaling to the buffer cache. The journaling > _has_ to be done independently, so we _have_ to have the dirty data > for journal transactions kept outside of the buffer cache. > > We cannot use the buffer.c dirty list anyway because bdflush can write > those buffers to disk at any time. Transactions have to control the > write ordering so we can only feed those writes into the buffer queues > under strict control when we go to commit a transaction. > > > All normal filesystems are using the mark_buffer_dirty() in buffer.c > > We're not talking about normal filesystems. :) > > > so currently the 40% setting of bdflush is a system-wide number and > > not a per-fs number. > > For filesystems that can use that mechanism, sure. We need to be able > to extend that mechanism so that filesystems with other writeback > mechanisms can use it too. > > > If both ext3 and reiserfs are using refile_buffer and both are using > > balance_dirty in the right places as Linus wants, all seems just fine to > > me. > > They aren't and they can't. > > > I completly agree to change mark_buffer_dirty() to call balance_dirty() > > before returning. > > Agreed. How can we use a mark_buffer_dirty that calls balance_dirty in a place where we cannot call balance_dirty? > > > --Stephen -- Get Linux (http://www.kernel.org) plus ReiserFS (http://devlinux.org/namesys). If you sell an OS or internet appliance, buy a port of ReiserFS! If you need customizations and industrial grade support, we sell them.
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
"Stephen C. Tweedie" wrote: > Hi, > > On Tue, 21 Dec 1999 20:21:05 -0500 (EST), "Benjamin C.R. LaHaise" > <[EMAIL PROTECTED]> said: > > > The buffer dirty lists are the wrong place to be dealing with this. We > > need a lightweight, fast way of monitoring the system's dirty buffer/page > > thresholds -- one that can be called for every write to a page or on the > > write faults for cow pages. > > Precisely. The only thing that the core VM needs to export is an atomic > counter for such pages, a wait queue so that processes can wait for > pages to be cleaned, and a function to be called to try to reclaim such > pages. > > Remember, though, that we have three different types of page we need to > deal with. There are simple used pages, which we need to reclaim in a > component-independent manner when we are using too much memory; then > there are dirty pages which can be flushed to disk at any time; then > there are reserved pages which cannot be flushed to disk without some > extra work. > > The first case is simple: we already have the wait queues and reclaim > functions in place, and all we need is an address_space callback to > allow filesystem-specific caches to return pages when shrink_mmap() > wants them. > > In the second case (dirty pages), bdflush already does some of the work, > but we need a more generic solution of we want to support dirty data > which is not stored in buffer_heads in a portable manner. > > The third case (reserved pages) is the case which doesn't affect any > current code but which will become really important for journaled or > deferred-allocation filesystems. > > --Stephen Sorry for intruding, I have been monitoring this thread with interest. I was thinking that, unless you want to have FS specific buffer/page cache, there is alway a gain for a unified cache for all fs. I think the one piece of functionality missing from the 2.3 implementation is the dependency between the various pages. If you could specify a tree relations between the various subset of the buffer/page and the reclaim machanism honor that everything should be fine. For FS that does not care about ordering, they could simply ignore this capability and the machanism could assume that everything is in one big set and could be reclaimed in any order. I have note been giving the complexity of implementing such functionality a thought yet. But it seem to be feasible - since you would need to do that any way for your FS
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
Hi, On Tue, 21 Dec 1999 14:57:29 +0100 (CET), Andrea Arcangeli <[EMAIL PROTECTED]> said: > So you are talking about replacing this line: > dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT; > with: > dirty = (size_buffers_type[BUF_DIRTY]+size_buffers_type[BUF_PINNED]) >> >PAGE_SHIFT; Basically yes, but I was envisaging something slightly different from the above. There may well be data which is simply not in the buffer cache at all but which needs to be accounted for as pinned memory. A good example would be if some filesystem wants to implement deferred allocation of disk blocks: the corresponding pages in the page cache obviously cannot be flushed to disk without generating extra filesystem activity for the allocation of disk blocks to pages. The pages must therefore be pinned, but as they don't yet have disk mappings we can't assume that they are in the buffer cache. So we really need a pinned page threshold which can apply to general pages, not necessarily to the buffer cache. There's another issue, though. BUF_DIRTY buffers do not necessarily count as pinned in this context: they can always be flushed to disk without generating any significant new memory allocation pressure. We still need to do write-throttling, so we need a threshold on dirty data for that reason. However, deferred allocation and transactions actually have a more subtle and nastier property: you cannot necessarily flush the pages from memory without first allocating more memory. In the transaction case this is because you have to allow transactions which are already in progress to complete before you can commit the transaction (you cannot commit incomplete transactions because that would defeat the entire point of a transactional system!). In the case of deferred disk block allocation, the problem is that flushing the dirty data requires extra filesystem operations as we allocate disk blocks to pages. In these cases we need to be able to make sure that not only does pinned memory never exceed a threshold, we also have to ensure that the *future* allocations required to flush the existing allocated memory can also be satisfied. We need to allow filesystems to "reserve" such extra memory, and we need a system-wide threshold on all such reservations. The ext3 journaling code already has support for reservations, but that's currently a per-filesystem parameter. We still have need for a global VM reservation to prevent memory starvation if multiple different filesystems have this behaviour. Note that what we need here isn't complex: it's no more than exporting atomic_t counts of the number of dirty and reserved pages in the system and supporting a maximum threshold on these values via /proc. The mechanism for observing these limits can be local to each filesystem: as long as there is an agreed counter in the VM where they can register their use of memory. --Stephen
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
On Tue, 21 Dec 1999, Stephen C. Tweedie wrote: >We cannot use the buffer.c dirty list anyway because bdflush can write >those buffers to disk at any time. Transactions have to control the So you are talking about replacing this line: dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT; with: dirty = (size_buffers_type[BUF_DIRTY]+size_buffers_type[BUF_PINNED]) >> PAGE_SHIFT; If you don't do that you don't need _two_ filesystems to generate too many dirty buffers but you can potentially go OOM with only one journaling filesystem running. As you talked about a _two_ filesystem case generating dirty buffers on 100% of memory I thought you was talking about something very different than the above one liner. If you was talking about it that's fine and I agree of course. >We're not talking about normal filesystems. :) With "normal" filesystems I meant filesystems that are _using_ linux/fs/buffer.c. Andrea
Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
Hi, On Tue, 21 Dec 1999 11:18:03 +0100 (CET), Andrea Arcangeli <[EMAIL PROTECTED]> said: > On Tue, 21 Dec 1999, Stephen C. Tweedie wrote: >> refile_buffer() checks in buffer.c. Ideally there should be a >> system-wide upper bound on dirty data: if each different filesystem >> starts to throttle writes at 50% of physical memory then you only >> need two different filesystems to overcommit your memory badly. > If all FSes shares the dirty list of buffer.c that's not true. The entire point of this is that Linus has refused, point blank, to add the complexity of journaling to the buffer cache. The journaling _has_ to be done independently, so we _have_ to have the dirty data for journal transactions kept outside of the buffer cache. We cannot use the buffer.c dirty list anyway because bdflush can write those buffers to disk at any time. Transactions have to control the write ordering so we can only feed those writes into the buffer queues under strict control when we go to commit a transaction. > All normal filesystems are using the mark_buffer_dirty() in buffer.c We're not talking about normal filesystems. :) > so currently the 40% setting of bdflush is a system-wide number and > not a per-fs number. For filesystems that can use that mechanism, sure. We need to be able to extend that mechanism so that filesystems with other writeback mechanisms can use it too. > If both ext3 and reiserfs are using refile_buffer and both are using > balance_dirty in the right places as Linus wants, all seems just fine to > me. They aren't and they can't. > I completly agree to change mark_buffer_dirty() to call balance_dirty() > before returning. Agreed. --Stephen