Re: [patch 30/41] reiserfs convert to new aops.
On Thu, May 24, 2007 at 03:46:46AM +0400, Vladimir V. Saveliev wrote: Hello, Nick On Wednesday 16 May 2007 21:22, Vladimir V. Saveliev wrote: Two questions - first, is it possible to get rid of reiserfs_prepare_write / commit_write? I assume this is not a problem anymore as long as prepare_write and commit_write of address space operations are NULL now (with last version of reiserfs-convert-to-new-aops.patch). Yes, that should be fine and allow us to remove some of the generic code now. Second, can you make use of AOP_FLAG_CONT_EXPAND in order to get rid of the special generic_cont_expand routine for reiserfs? Sorry for delay with this. Did you mean something like this? Yes I did, thanks very much, we can now get rid of that weird generic_cont_expand too (assuming these patches will ever get merged). Thanks for doing this Vladimir! From: Vladimir Saveliev [EMAIL PROTECTED] This patch makes reiserfs to use AOP_FLAG_CONT_EXPAND in order to get rid of the special generic_cont_expand routine Signed-off-by: Vladimir Saveliev [EMAIL PROTECTED] diff -puN fs/reiserfs/inode.c~fs-reiserfs-use-generic_cont_expand_simple fs/reiserfs/inode.c --- linux-2.6.21-mm2/fs/reiserfs/inode.c~fs-reiserfs-use-generic_cont_expand_simple 2007-05-24 02:29:52.0 +0300 +++ linux-2.6.21-mm2-vs/fs/reiserfs/inode.c 2007-05-24 02:34:26.0 +0300 @@ -2561,13 +2561,20 @@ static int reiserfs_write_begin(struct f int ret; int old_ref = 0; + inode = mapping-host; + *fsdata = 0; + if (flags AOP_FLAG_CONT_EXPAND + (pos (inode-i_sb-s_blocksize - 1)) == 0) { + pos ++; + *fsdata = (void *)flags; + } + index = pos PAGE_CACHE_SHIFT; page = __grab_cache_page(mapping, index); if (!page) return -ENOMEM; *pagep = page; - inode = mapping-host; reiserfs_wait_on_write_block(inode-i_sb); fix_tail_page_for_writing(page); if (reiserfs_transaction_running(inode-i_sb)) { @@ -2677,6 +2684,8 @@ static int reiserfs_write_end(struct fil struct reiserfs_transaction_handle *th; unsigned start; + if ((unsigned)fsdata AOP_FLAG_CONT_EXPAND) + pos ++; reiserfs_wait_on_write_block(inode-i_sb); if (reiserfs_transaction_running(inode-i_sb)) @@ -3065,7 +3074,7 @@ int reiserfs_setattr(struct dentry *dent } /* fill in hole pointers in the expanding truncate case. */ if (attr-ia_size inode-i_size) { - error = generic_cont_expand(inode, attr-ia_size); + error = generic_cont_expand_simple(inode, attr-ia_size); if (REISERFS_I(inode)-i_prealloc_count 0) { int err; struct reiserfs_transaction_handle th; _
Re: Announce: new-aops-1 for 2.6.21-rc3
On Thu, Mar 15, 2007 at 04:47:13PM -0700, Mark Fasheh wrote: On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote: (excludes the OCFS2 patch that Mark sent, in anticipation of an update) Attached is said patch. I needed to export __grab_cache_page (ext2/ext3 also need this if they're to be built as modules), so a patch to do that is also attached. This passed some preliminary testing on a two node cluster I have here at Oracle. Thanks Mark, I've merged these. Nick
Announce: new-aops-1 for 2.6.21-rc3
OK, I've gone through and fixed several bugs until the thing actually survives fsx-linux for both ext2 and ext3 ordered and writeback (both when using the new aops, and the legacy prepare_write path). Actually ext3 sometimes breaks, but it does in unpatched kernels anyway. At 15 patches (including the initial buffered write deadlock fixes), it is too much to keep posting -- not much has fundamentally changed, so I'll just post occasionally if we make big changes. The quilt format is probably easier for someone wishing to work on it anyway. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ (excludes the OCFS2 patch that Mark sent, in anticipation of an update) It would be really nice if filesystem developers could take a look at the new interfaces some time, because otherwise they might get stuck with it :) So I'm cc'ing a few filesystems that come to mind, that I haven't heard anything from. Thanks, Nick
Re: Announce: new-aops-1 for 2.6.21-rc3
On Thu, Mar 15, 2007 at 12:32:45PM -0700, Joel Becker wrote: On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote: At 15 patches (including the initial buffered write deadlock fixes), it is too much to keep posting -- not much has fundamentally changed, so I'll just post occasionally if we make big changes. The quilt format is probably easier for someone wishing to work on it anyway. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/ For future drops, can you provide the unpacked patches too, so lazy people like me can read them in the browser? Thanks. Sorry, I did intend to unpack that, but forgot. It's done now, the new directory containing the patches is under the same URL as above. Thanks, Nick
Re: Announce: new-aops-1 for 2.6.21-rc3
On Thu, Mar 15, 2007 at 12:53:51PM -0700, Mark Fasheh wrote: On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote: OK, I've gone through and fixed several bugs until the thing actually survives fsx-linux for both ext2 and ext3 ordered and writeback (both when using the new aops, and the legacy prepare_write path). Actually ext3 sometimes breaks, but it does in unpatched kernels anyway. At 15 patches (including the initial buffered write deadlock fixes), it is too much to keep posting -- not much has fundamentally changed, so I'll just post occasionally if we make big changes. The quilt format is probably easier for someone wishing to work on it anyway. Hmm, we still left out some exports... Thanks, applied.
reiserfs set_page_dirty vs truncate race?
Hi, I wonder if resierfs needs to be careful about truncated pages, or does your .invalidatepage provide synchronisation against set_page_dirty somehow? -- Index: linux-2.6/fs/reiserfs/inode.c === --- linux-2.6.orig/fs/reiserfs/inode.c +++ linux-2.6/fs/reiserfs/inode.c @@ -2840,10 +2840,14 @@ static void reiserfs_invalidatepage(stru static int reiserfs_set_page_dirty(struct page *page) { - struct inode *inode = page-mapping-host; - if (reiserfs_file_data_log(inode)) { - SetPageChecked(page); - return __set_page_dirty_nobuffers(page); + struct address_space *mapping = page-mapping; + struct inode *inode; + if (mapping) { + inode = mapping-host; + if (reiserfs_file_data_log(inode)) { + SetPageChecked(page); + return __set_page_dirty_nobuffers(page); + } } return __set_page_dirty_buffers(page); }
Re: reiserfs set_page_dirty vs truncate race?
On Tue, Oct 10, 2006 at 06:29:27PM +0400, Vladimir V. Saveliev wrote: Hello On Tuesday 10 October 2006 11:44, Nick Piggin wrote: Hi, I wonder if resierfs needs to be careful about truncated pages, I think it is assumed that __set_page_dirty_nobuffers and __set_page_dirty_buffers care about truncated pages. So they may operate on truncated pages? Then you probably need the fix. Anyway, if you would like to merge it, you may add a Signed-off-by: Nick Piggin [EMAIL PROTECTED] Thanks, Nick
[patch] reiser4 releasepage fix
Hi Andrew, Can you put the following in -mm please? I cc'ed Hans about it earlier. Basically, resier4 does not present a good reason why pagecache should be dropped here. Basically the exact same operations are performed by releasepage callers (reclaim or truncate). If there is some atomicity problem, the code removed here wouldn't solve it anyway. And unless reiser4 performs some of its own special synchronisation, I can't see how it gets the pagecache removal sequence right for the reclaim case (ie. careful checks of page_count and PageDirty under tree_lock). With this patch in place, reiser4-reget-page-mapping.patch can go. I would also be worried about exporting remove_from_page_cache... at least the __ variant can stop being exported now. -- Index: linux-2.6/fs/reiser4/as_ops.c === --- linux-2.6.orig/fs/reiser4/as_ops.c +++ linux-2.6/fs/reiser4/as_ops.c @@ -358,11 +358,6 @@ int reiser4_releasepage(struct page *pag assert(reiser4-4, page-mapping != NULL); assert(reiser4-5, page-mapping-host != NULL); - /* is_page_cache_freeable() check - (mapping + private + page_cache_get() by shrink_cache()) */ - if (page_count(page) 3) - return 0; - if (PageDirty(page)) return 0; @@ -385,14 +380,6 @@ int reiser4_releasepage(struct page *pag /* we are under memory pressure so release jnode also. */ jput(node); - write_lock_irq(mapping-tree_lock); - /* shrink_list() + radix-tree */ - if (page_count(page) == 2) { - __remove_from_page_cache(page); - atomic_dec(page-_count); - } - write_unlock_irq(mapping-tree_lock); - return 1; } else { spin_unlock((node-load));
Re: [patch] reiser4 releasepage fix
On Sat, Apr 15, 2006 at 10:57:58AM +0200, Nick Piggin wrote: Hi Andrew, Can you put the following in -mm please? I cc'ed Hans about it earlier. I did test it under reclaim pressure, btw. Seems to be working OK.
Re: [patch] reiser4 releasepage fix
On Sat, Apr 15, 2006 at 06:17:33PM +0400, Alexander Zarochentsev wrote: hello, On Saturday 15 April 2006 12:57, Nick Piggin wrote: Hi Andrew, Can you put the following in -mm please? I cc'ed Hans about it earlier. Basically, resier4 does not present a good reason why pagecache should be dropped here. correct. it duplicates remove_mapping(). we tested similar patch recently and found no problems. Oh good :) -- Index: linux-2.6/fs/reiser4/as_ops.c === --- linux-2.6.orig/fs/reiser4/as_ops.c +++ linux-2.6/fs/reiser4/as_ops.c @@ -358,11 +358,6 @@ int reiser4_releasepage(struct page *pag assert(reiser4-4, page-mapping != NULL); assert(reiser4-5, page-mapping-host != NULL); Except my patch didn't remove the page ref count check from reiser4_releasepage. Nick, does the check conflict with the lockless pagecache patches? Reiser4 just tries to avoid unnecessary detaching private page objects if the page_count doesn't allow page removal. Well the check is racy anyway because if the page is in pagecache, then there is nothing to prevent find_get_page from running (eg. page fault, or read(2)). So strictly, no, the check does not conflict with lockless pagecache. It is simply nice to cut down the number of places where it isn't immediately obvious. I'd also like to think about removing page_count from outside mm/, in favour of a pagecache API which gives us more confidence that the caller is sane and has taken correct locks, etc. Anway, if you can ack this for Andrew, I'm sure he'd sleep a bit better (as would I)! - /* is_page_cache_freeable() check - (mapping + private + page_cache_get() by shrink_cache()) */ - if (page_count(page) 3) - return 0; - if (PageDirty(page)) return 0; BTW. the PageDirty check is also racy, so it can probably go too. I didn't have the courage to remove it because I don't know whether your page dirtying functions try to somehow synchronise against this. Thanks, Nick
elevators (was Re: I request inclusion of reiser4 in the mainline kernel)
CC list trimmed. On Tue, 2005-09-20 at 00:59 -0700, Hans Reiser wrote: Nick Piggin wrote: On Mon, 2005-09-19 at 23:28 -0700, Hans Reiser wrote: Well the terminology changed to io scheduler now, however the residual elevator name found in places doesn't cause anyone any problems and there isn't much reason to change it other than the desire to break things. Did you really say that?I mean, come on, can't you at least manage a well, it ought to get changed but I am busy with something more exciting to me. In a perfect world maybe it should be changed, however I don't know what out of tree drivers rely on the interface and I really don't care to find out. Simple as that. Ask Nate about this after he gets an ok from the customer to disclose his work. It is not so simple as you claim. Nate, I would be very interested to hear about your work if and when you are able to disclose it. [snip devfs] Yeah I was just trying to introduce some humour to the thread! Or maybe deflate one flamewar by starting another :) Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com
Re: I request inclusion of reiser4 in the mainline kernel
Hans Reiser wrote: So why is the code in the kernel so hard to read then? Linux kernel code is getting better, and Andrew Morton's code is especially good, but for the most part it's unnecessarily hard to read. Look at the elevator code for instance. Ugh. What's wrong with the elevator code? The elevator code was one of the first things I got involved with as a complete kernel newbie, and I was able to follow the current code well enough to make a new IO scheduler, and extend the elevator API sufficiently to provide the fairly unique capabilities I needed. If it is the elevator *API* you are worried about, that is fairly trivial and well documented by Jens and myself in Documentation/block/biodoc.txt, along with an overview of some key ideas useful for iosched implementors. as-iosched.c itself is IMO reasonably well commented (at least the non-trivial, non-boilerplate functions). That is not to say it is trivial to understand because it is a fairly complex state machine and heuristics, but at less than 2000 lines of very well contained code it is not an impossible task to understand it. If that is too much for you, noop-iosched.c implements a fully working io scheduler in exactly 94 lines, including whitespace and comments. What are your specific concerns? I would be interested in helping to fix any valid ones you have. Thanks, Nick Send instant messages to your online friends http://au.messenger.yahoo.com
Re: reiser4 plugins
Markus Törnqvist wrote: I can't find the original post I'm thinking about but http://lkml.org/lkml/2005/5/16/68 says essentially the same thing. The scheduler is being improved for better behaviour on complex topologies like multi core + NUMA and multi level NUMA systems. If Con's work had gone in first, then conversely these improvements would have had to wait. There's also my all-time favorite, http://lkml.org/lkml/2005/3/14/4 What's wrong with that? The slowdown is due to the workload becoming disk bound. The reasons are still not entirely clear, but I don't think it is a recent (ie. 2.6) regression (or even a regression at all IIRC). The lack of QA seems appalling here, and I'm sure Reiser has had to do more of that for DARPA than most linux kernel hackers around. And what QA would you have preferred? I think if you are resorting to bringing up all time favourite blunders when trying to justify Reiser4 being included, then that is a sign right there that something is fundamentally wrong (if not with the code, then with your line of thought0 And note my email has nothing to do with any *real* argument for or against R4. Thanks, Nick Send instant messages to your online friends http://au.messenger.yahoo.com
Re: reiser4 plugins
Markus Törnqvist wrote: On Mon, Jun 27, 2005 at 07:46:15PM +1000, Nick Piggin wrote: The scheduler is being improved for better behaviour on complex topologies like multi core + NUMA and multi level NUMA systems. If Con's work had gone in first, then conversely these improvements would have had to wait. Or be merged in later. It _was_ merged later. Con was talking about Andrew's -mm tree. The problem is, why do the interfaces have to live so much that examples like this come to my ears (or eyes ;) all the time? The scheduler interfaces are probably among the most stable in the kernel. Con was talking about internal implementation. I hate to say this without digging out any URLs, but one friend of mine says he has a very hard time doing any networking code because it's too labile. Maybe that's being embettered for something else too? It seems there are plenty of people who can do networking code just fine. Or the other friend who curses that the networking code is just crap and basically has to rewrite the code to get it working. Yes, I've tried to get these guys to submit their code, but they argue back that no one wants to see it. I get the feeling your friend is making up some tall tales if he says he rewrote the networking code to be much better than it is now. But I can guarantee him that if he has it will be of great interest to the network developers. There's also my all-time favorite, http://lkml.org/lkml/2005/3/14/4 What's wrong with that? The slowdown is due to the workload becoming disk bound. The reasons are still not entirely clear, but I don't think it is a recent (ie. 2.6) regression (or even a regression at all IIRC). It's not that easy. So let's say the scheduler slowdown is a temporary situation to adapt it to multicore+NUMA. I assume that's what you mean, by having the kernels on par with 2.6.10 and the above paragraph. I'm pretty sure it isn't a scheduler slowdown, but rather the system is getting disk bound. I think the reporter has performance back to 2.6.10 levels - it is actually something I still have to follow up on, but it is difficult because the reporter isn't able to share his code, although he's otherwise very helpful. Why on earth did the slowdown ever get merged and we have to wait for it to be on par with some older version? If slowdowns do get merged, it is generally because they either haven't been reported until being merged or they make an accepted tradeoff. Not that they haven't been tested. Let's just get this straight, no amount of QA other than releasing a kernel and asking people to use it is going to find all the bugs that will be encountered. Please do correct me if I'm wrong :) I mostly agree with you apart from your specific examples I think we could do things better, and most of us have made some big blunders, However there is just no way that bringing any of that up will help Reiser4 being merged. Send instant messages to your online friends http://au.messenger.yahoo.com