Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Tue, 24 Oct 2000, Petr Vandrovec wrote: > On 23 Oct 00 at 23:05, Alexander Viro wrote: > > > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what > > analysis had been done? Petr, can you reproduce the problem on -test7? > > Unfortunately, clean test would take the backport of ext2 changes > > (truncate-related, happened around the same time), but IIRC -test7 was > > relatively stable... > > Just FYI - I was able to reproduce it with 2.4.0-test5 and all laters > (different pre-s of test6/7/8/9/10) I have. I was not able to reproduce > it with 2.2.17-pre13 (although this kernel had really bad behavior when > I run out of space with backing file - sendmail suspended itself due > to load of 37... on machine which has usually load 0.01). > > I do not have kernels between 2.2.17 and 2.4.0-test5 installed. > > If you have anything I should check... I see the race that can lead to screwups you've seen, but I don't see a really clear way to close it. Race in question: truncate_inode_pages() removes the page from pagecache. sync_pte is called before we get to killing the pagetable references. Potentially, the following might be more or less OK, but I'm not too happy with it: * All ->nopage() methods mark page uptodate (upon success, that is). * do_no_page() fails if page is not uptodate. * vmtruncate starts with setting the size and going through the pagecache, removing all fully truncated pages from the hash chains and gathering them in the list and removing uptodate flag. * then it does vmtruncate_list() calls, zapping the pagetable references to the pages in question. * *then* it goes through the list and does the rest of truncate_inode_pages() work. The theory being: once the i_size is set, no new pages will be bound to affected part of file. when we unhash all affected pages and mark them non-uptodate there will be no new pagetable references to them (they will not be found, and any pagefault in progress will fail since we mark them non-uptodate). any possible IO-in-progress on these pages will not be disturbed - it doesn't depend on the page being hashed. Aliases do not matter, since they won't be bound to the relevant part of file anyway. once we kill all pagetable references (and no new references are going to appear at that point) we can be sure that no process will initiate _any_ operations on these pages. Notice that potential dirty data on them doesn't matter - it's going to die anyway (file is being truncated). *now* we grab the page lock on each of these pages, thus letting any method-in-progress to be completed and that point we can safely kill ->mapping. Linus, could you comment on that variant? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 23:05, Alexander Viro wrote: > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what > analysis had been done? Petr, can you reproduce the problem on -test7? > Unfortunately, clean test would take the backport of ext2 changes > (truncate-related, happened around the same time), but IIRC -test7 was > relatively stable... Just FYI - I was able to reproduce it with 2.4.0-test5 and all laters (different pre-s of test6/7/8/9/10) I have. I was not able to reproduce it with 2.2.17-pre13 (although this kernel had really bad behavior when I run out of space with backing file - sendmail suspended itself due to load of 37... on machine which has usually load 0.01). I do not have kernels between 2.2.17 and 2.4.0-test5 installed. If you have anything I should check... Thanks, Petr Vandrovec [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > Note that if there really are only 9 "nopage" routines, then it is a lot > easier to just add the single "SetPageUptodate(page)" into those 9 > routines, and thus let the VM know of the race. Works for me. And yes, the list of ->nopage instances that can return success is that short: drm_vm_shm_nopage drm_vm_shm_nopage_lock drm_vm_dma_nopage sgi_graphics_nopage via_mm_nopage ncp_file_mmap_nopage shm_nopage shmzero_nopage filemap_nopage - sorry, it's not 9+filemap_nopage, it's 8+filemap_nopage. However, it still leaves a window for the race: we invalidate first and remove from pagetables later. And ClearPageUptodate() is obviously truncate_inode_pages() work. Grr... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Tue, 24 Oct 2000, Alexander Viro wrote: > > It's not the only problem, but I would feel _much_ safer if pagefault > wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the > insertion into page tables _within_ ->nopage()? NO! We used to do this a LOONG time ago. Distributing the VM information on how to properly insert a pte into the page table into drivers is _NOT_ something I want to do again. It was a pain to get it properly modularized, we're not going back to the horror. Note that if there really are only 9 "nopage" routines, then it is a lot easier to just add the single "SetPageUptodate(page)" into those 9 routines, and thus let the VM know of the race. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > On Mon, 23 Oct 2000, Alexander Viro wrote: > > > > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what > > analysis had been done? Petr, can you reproduce the problem on -test7? > > I don't think that is it - that code looks very straightforward (and is > needed on some silly architectures that cannot easily otherwise see if > they need to be coherent wrt user space - mainly sparc and virtual > caches). OK, I see where the race can happen. Yes, vmtruncate() tries to kill the mappings. Right. However, it does that _after_ truncate_inode_pages(). And there is a window when the only lock we are holding is ->i_sem. sync_pte in that window ==> we are fucked. So the question being: WTF do we postpone zapping the page tables until after the truncate_inode_pages()? The following rules might make life simpler, AFAICS: * as soon as ->i_size is set, no new pagetable references to off-limits pages can appear. * as soon as we are don with vmtruncate_list() there is no pagetable references. * truncate_inode_pages() never has to deal with pages refered from pagetables. * ->i_size can't increase until we return from vmtruncate(). It's not the only problem, but I would feel _much_ safer if pagefault wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the insertion into page tables _within_ ->nopage()? Look: let's take the tail of do_no_page() into helper function and just call it from the end of every bloody ->nopage() out there. It _is_ easy: we have only 9 instances in the tree not counting filemap_nopage(). Moreover, do_anonymous_page() will become symmetrical to the rest of the crowd. Comments? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Alexander Viro wrote: > > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what > analysis had been done? Petr, can you reproduce the problem on -test7? I don't think that is it - that code looks very straightforward (and is needed on some silly architectures that cannot easily otherwise see if they need to be coherent wrt user space - mainly sparc and virtual caches). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what analysis had been done? Petr, can you reproduce the problem on -test7? Unfortunately, clean test would take the backport of ext2 changes (truncate-related, happened around the same time), but IIRC -test7 was relatively stable... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > Also, the fact that Petr didn't see anything trigger in nopage() makes me > nervous again. Even if the problem happened during read-ahead, it should > have gotten into the address space only through nopage. Maybe there is > some vma that isn't added to the right inode VM list - so that we end up > missing part of the vmtruncate() stuff? > > Al, any ideas? Just one: let's slow down a bit and try to write down the rules. Close to release or not, let's understand WTF happens in that part of VM before deciding on the choice of band-aids/fixes. Frankly, right now I don't feel that area - too many changes during the last month. So I'm going to sit down and read it through. IMO it's worth the delay - I have a gut feeling that band-aids are going to cost more. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > > I'm starting to suspect that we leave this path as-is, and just fix the > mapping case (and PageUptodate() can work there). That should also avoid > the nasties. ..and even that looks like I'd have to do the quick-and-dirty case with the race still there on SMP. Adding the page Uptodate logic to the VM layer proper is too painful at this point - every single nopage function would have to be updated to mark its page up-to-date, as they don't generally do that currently. Also, the fact that Petr didn't see anything trigger in nopage() makes me nervous again. Even if the problem happened during read-ahead, it should have gotten into the address space only through nopage. Maybe there is some vma that isn't added to the right inode VM list - so that we end up missing part of the vmtruncate() stuff? Al, any ideas? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > With ClearPageDirty() kernel locked up (but no watchdog, so probably > some livelock) during bootup after fsck / Actually, it turns out that even with this issue fixed, there's the more serious issue that the page _has_ to be removed from the page cache once we get to the point that we're freeing the whole inode (which also calls truncate_inode_pages). Which means that we cannot take the easy way out and say "let's delay the freeing until everything is ok" - even if a buffer is busy due to some unfortunate timing (so that we turn the page into anonymous buffers), we'd better get rid of the page from the page cache. I'm starting to suspect that we leave this path as-is, and just fix the mapping case (and PageUptodate() can work there). That should also avoid the nasties. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > With ClearPageDirty() kernel locked up (but no watchdog, so probably > some livelock) during bootup after fsck /. Yeah, the way the truncate logic works right now truncate_whole_page() has to remove the page from the inode list - otherwise truncate ends up looping forever trying to truncate that page ;). And there was a off-by-one error in my first untested version anyway: the page_count() should be tested against "2", not "1", as the truncate logic has elevated the count anyway (probably unnecessarily: once we get the page lock nobody else can race to remove it from the page cache anyway, so it's not as if it could go away). > Should I try ClearPageUptodate() instead? No, I'll have to fix the truncate logic to allow for this all. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 14:34, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Alexander Viro wrote: > > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > > > > Nope, that just makes the race window smaller. We should check for i_size > > > after we've gotten the page table lock and just before actually entering > > > the page into the page tables. Otherwise we'll still race on SMP (a _very_ > > > hard window to get into, admittedly). > > > > Umm... I would probably remove Uptodate upon truncate() and check _that_ > > in the place you've mentioned. > > Works for me.. > > > > ClearPageDirty(page); > > ClearPageUptodate(page); > > > > How about that? > > Makes sense. With ClearPageDirty() kernel locked up (but no watchdog, so probably some livelock) during bootup after fsck /. Unfortunately, 'PC' column shows complete garbage - init,kswapd,kupdate and rcS in 'R', with 'PC'=current for rcS, 0xc1459f28 for init, 0xc146ffa8 for kswapd and 0xc1469fc8 for kupdate (needless to say that my kernel does not have 20MB... whee what's with PC? it now prints esp instead of eip?). Should I try ClearPageUptodate() instead? Petr Vandrovec [EMAIL PROTECTED] P.S.: I have to go home. Continuing after 10 hours. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Alexander Viro wrote: > > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > > Nope, that just makes the race window smaller. We should check for i_size > > after we've gotten the page table lock and just before actually entering > > the page into the page tables. Otherwise we'll still race on SMP (a _very_ > > hard window to get into, admittedly). > > Umm... I would probably remove Uptodate upon truncate() and check _that_ > in the place you've mentioned. Works for me.. > > ClearPageDirty(page); > ClearPageUptodate(page); > > How about that? Makes sense. Note that I'd actually like to hear from Petr first _without_ any added code in the nopage() handler - the issue of having a page mapped after nopage() that we shouldn't have mapped is a separate one, and I'd first like to hear if the problem really goes away even if the mapping bug is still there.. And _then_ we fix the fact that we should not allow anybody to have a page mapped past i_size. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). > This signals to me: should not vmtruncate_list acquire mm->mmap_sem, > if it modifies page tables? No. It should get the page_table lock, but that is sufficient for anybody who _clears_ page tables (and is pretty much the same case as paging something out of somebody elses page tables). >I cannot find anything what prevents doing > vmtruncate in one task and filemap_sync in another - neither > page_table_lock spinlock, nor mmap_sem semaphore... filemap_sync() does hold the page table lock. (Which is certainly not to say that it's necessarily bug-free, but I don't see any obvious problems off-hand). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Alexander Viro wrote: > > > > That's fine, but I'm afraid that we'll need a bit more than that. A couple of > > obvious ones: > > * filemap_nopage() needs the second check for ->i_size. Upon exit. > > Nope, that just makes the race window smaller. We should check for i_size > after we've gotten the page table lock and just before actually entering > the page into the page tables. Otherwise we'll still race on SMP (a _very_ > hard window to get into, admittedly). Umm... I would probably remove Uptodate upon truncate() and check _that_ in the place you've mentioned. > So how about this truncate_complete_page() implementation: > > /* >* Try to get rid of a page.. Clear it if it fails >* for some reason. The page must be locked upon calling >* this function. >* >* We remove the page from the page cache _after_ we have >* destroyed all buffer-cache references to it. Otherwise some >* other process might think this inode page is not in the >* page cache and creates a buffer-cache alias to it causing >* all sorts of fun problems ... >*/ > static inline void truncate_complete_page(struct page *page) > { > /* Try to get rid of buffers */ > if (page->buffers) > block_flushpage(page, 0); > > spin_lock(&pagecache_lock); > spin_lock(&pagemap_lru_lock); > > if (page_count(page) != 1) { > memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE); > } else { > ClearPageDirty(page); ClearPageUptodate(page); How about that? > __lru_cache_del(page); > __remove_inode_page(page); > page_cache_release(page); > } > spin_unlock(&pagemap_lru_lock); > spin_unlock(&pagecache_lock); > } > > we should probably special-case the "block_flushpage()" failed case, but > the above should do reasonable things with it (because page_count() will > be > 1 due to buffers). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Alexander Viro wrote: > > That's fine, but I'm afraid that we'll need a bit more than that. A couple of > obvious ones: > * filemap_nopage() needs the second check for ->i_size. Upon exit. Nope, that just makes the race window smaller. We should check for i_size after we've gotten the page table lock and just before actually entering the page into the page tables. Otherwise we'll still race on SMP (a _very_ hard window to get into, admittedly). > Moreover, what is (area->vm_mm == current->mm) doing in the existing check? It's for ptrace. You can do ugly things with ptrace that aren't possible for the process itself. > * truncate() should zero the page out if it doesn't remove it from > cache. So how about this truncate_complete_page() implementation: /* * Try to get rid of a page.. Clear it if it fails * for some reason. The page must be locked upon calling * this function. * * We remove the page from the page cache _after_ we have * destroyed all buffer-cache references to it. Otherwise some * other process might think this inode page is not in the * page cache and creates a buffer-cache alias to it causing * all sorts of fun problems ... */ static inline void truncate_complete_page(struct page *page) { /* Try to get rid of buffers */ if (page->buffers) block_flushpage(page, 0); spin_lock(&pagecache_lock); spin_lock(&pagemap_lru_lock); if (page_count(page) != 1) { memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE); } else { ClearPageDirty(page); __lru_cache_del(page); __remove_inode_page(page); page_cache_release(page); } spin_unlock(&pagemap_lru_lock); spin_unlock(&pagecache_lock); } we should probably special-case the "block_flushpage()" failed case, but the above should do reasonable things with it (because page_count() will be > 1 due to buffers). The above is obviously completely and utterly untested. Petr? Willing to give this a go? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 13:57, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > > First page->mapping == NULL entry in syslog is dated 22:23:58, but > > couple of entries was lost before (probably I should print only '.' for > > each such page; this run there was more than 100 such pages) > > Another question is why SIGCHLD was delivered to parent AFTER ftruncate, > > but exit was invoked couple of seconds before - maybe it syncs > > child address space to disk? > > exit() basically does do a msync(MSASYNC), so that could be it. Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). This signals to me: should not vmtruncate_list acquire mm->mmap_sem, if it modifies page tables? I cannot find anything what prevents doing vmtruncate in one task and filemap_sync in another - neither page_table_lock spinlock, nor mmap_sem semaphore... Thanks, Petr Vandrovec [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Alexander Viro wrote: > > > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > > > Al, any ideas? I have this feeling that the simplest fix is just to leave > > > the race open, and make truncate_complete_page() just leave such a "racy" > > > page in the page cache. It will still race, and the invalid page will > > > still exist, but the end result should be harmless. > > > > Provided that we clean it - why the hell do we want to take it out of > > the pagecache? I don't see any fundamental reasons to prohibit pages > > past the ->i_size being hashed. > > In fact, we used to allow them. > > We do want to remove it from the page cache under normal circumstances: it > makes for much better MM behaviour (ie free pages that are truly useless). > But I suspect that just adding a test for "page_count(page) == 1" (ie same > as for the page cache invalidation) before freeing it should give that > advantage, along with avoiding the problematic unlikely case... That's fine, but I'm afraid that we'll need a bit more than that. A couple of obvious ones: * filemap_nopage() needs the second check for ->i_size. Upon exit. Moreover, what is (area->vm_mm == current->mm) doing in the existing check? * truncate() should zero the page out if it doesn't remove it from cache. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > Yes. Bad news. No problem was catched in filemap_nopage, but one > (of 57000) pages was dirty and had page->mapping == NULL... (maybe > only one was caused that this was just after bootup, with plenty of memory) > Maybe I should look at readahead code? Yes, you're right, read-ahead is in fact even more likely to catch the race. > In case of truncate it is going irrevocable away. Accesses after truncate > should (and sometime give you) SIGBUS... Yes. But I'd rather have a small race where we under certain strange circumstances forget to raise a SIGBUS than have a kernel bug that causes oops and possible filesystem corruption. So I'm not just looking for a fix, I'm also looking for a SIMPLE fix, with perhaps some major surgery later (things like making the inode semaphore a read-write semaphore and just always synchronizing 100% on i_size - which would fix the problem, but is not a 2.4.x thing, no way Jose). > First page->mapping == NULL entry in syslog is dated 22:23:58, but > couple of entries was lost before (probably I should print only '.' for > each such page; this run there was more than 100 such pages) > Another question is why SIGCHLD was delivered to parent AFTER ftruncate, > but exit was invoked couple of seconds before - maybe it syncs > child address space to disk? exit() basically does do a msync(MSASYNC), so that could be it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Alexander Viro wrote: > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > Al, any ideas? I have this feeling that the simplest fix is just to leave > > the race open, and make truncate_complete_page() just leave such a "racy" > > page in the page cache. It will still race, and the invalid page will > > still exist, but the end result should be harmless. > > Provided that we clean it - why the hell do we want to take it out of > the pagecache? I don't see any fundamental reasons to prohibit pages > past the ->i_size being hashed. In fact, we used to allow them. We do want to remove it from the page cache under normal circumstances: it makes for much better MM behaviour (ie free pages that are truly useless). But I suspect that just adding a test for "page_count(page) == 1" (ie same as for the page cache invalidation) before freeing it should give that advantage, along with avoiding the problematic unlikely case... Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 16:13, Alexander Viro wrote: > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > Al, any ideas? I have this feeling that the simplest fix is just to leave > > the race open, and make truncate_complete_page() just leave such a "racy" > > page in the page cache. It will still race, and the invalid page will > > still exist, but the end result should be harmless. > > Provided that we clean it - why the hell do we want to take it out of > the pagecache? I don't see any fundamental reasons to prohibit pages > past the ->i_size being hashed. Methods _must_ check for ->i_size, > but they do it anyway. All race-prevention is based on page locks and > ->i_sem. > > Yes, filemap_nopage() should check for i_size at the very end and fail if > the page became off-limits. But that's completely unrelated issue - it's > mmap semantics, not pagecache one. Yes. Bad news. No problem was catched in filemap_nopage, but one (of 57000) pages was dirty and had page->mapping == NULL... (maybe only one was caused that this was just after bootup, with plenty of memory) Maybe I should look at readahead code? Although to be clear I do not know why. Unless there is bug in logic in test program, it should first dirty pages, and AFTER that it should truncate - and unmap and exit, without ever touching pages of mapping... My first testcases were with this race (and with raw devices), but then I found (by removing more and more code) that no race (and no raw devices) are required... > The point being: we should _never_ drop ->mapping unless the page is > irrevocably going away. We can (and probably should) drop the off-limits > page as soon as ->count hits zero, but we should not do it before that. In case of truncate it is going irrevocable away. Accesses after truncate should (and sometime give you) SIGBUS... total used free sharedbuffers cached Mem:255768 42208 213560 0496 18420 -/+ buffers/cache: 23292 232476 Swap: 530136 13200 516936 Strace of another run: 1688 22:23:41.748438 execve("./oopsdemo", ["./oopsdemo"], [/* 18 vars */]) = 0 1688 22:23:41.749058 brk(0)= 0x8049ae8 1688 22:23:41.749399 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40016000 1688 22:23:41.749641 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) 1688 22:23:41.749861 open("/etc/ld.so.cache", O_RDONLY) = 4 1688 22:23:41.750011 fstat(4, {st_mode=S_IFREG|0644, st_size=43818, ...}) = 0 1688 22:23:41.750253 old_mmap(NULL, 43818, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40017000 1688 22:23:41.750408 close(4) = 0 1688 22:23:41.750538 open("/lib/libc.so.6", O_RDONLY) = 4 1688 22:23:41.750676 fstat(4, {st_mode=S_IFREG|0755, st_size=1057576, ...}) = 0 1688 22:23:41.750878 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\224\314"..., 4096) = 4096 1688 22:23:41.751173 old_mmap(NULL, 1072484, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x40022000 1688 22:23:41.751327 mprotect(0x4011e000, 40292, PROT_NONE) = 0 1688 22:23:41.751441 old_mmap(0x4011e000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0xfb000) = 0x4011e000 1688 22:23:41.751633 old_mmap(0x40124000, 15716, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40124000 1688 22:23:41.751809 close(4) = 0 1688 22:23:41.753291 munmap(0x40017000, 43818) = 0 1688 22:23:41.753554 getpid() = 1688 1688 22:23:41.753785 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(4, 1), ...}) = 0 1688 22:23:41.753993 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40017000 1688 22:23:41.754162 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0 1688 22:23:41.754430 write(1, "Go\n", 3) = 3 1688 22:23:41.754727 open("ram0", O_RDWR|O_CREAT, 0600) = 4 1688 22:23:41.754949 unlink("ram0")= 0 1688 22:23:41.755103 ftruncate(4, 234881024) = 0 1688 22:23:41.755228 old_mmap(NULL, 234881024, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x40128000 1688 22:23:41.755700 pipe([5, 6]) = 0 1688 22:23:41.755847 pipe([7, 8]) = 0 1688 22:23:41.756024 fork()= 1689 1689 22:23:41.756735 close(6 1688 22:23:41.756798 close(5 1689 22:23:41.756844 <... close resumed> ) = 0 1688 22:23:41.756894 <... close resumed> ) = 0 1689 22:23:41.756949 close(7 1688 22:23:41.756997 close(8 1689 22:23:41.757041 <... close resumed> ) = 0 1688 22:23:41.757089 <... close resumed> ) = 0 1689 22:23:41.757139 close(4 1688 22:23:41.757188 write(6, "\0", 1 1689 22:23:41.757250 <... close resumed> ) = 0 1688 22:23:41.757301 <... write resumed> ) = 1 1689 22:23:41.757355 read(5, 1688 22:23:41.757405 read(7, 1689 22:23:41.757450 <... read resumed> "\0", 1) = 1 1689 22:23:49.260756 write(8, "\0", 1) = 1 1689 22:23:49.436204 read(5, 1688 22:23:49.442969 <... read resumed> "\
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > Al, any ideas? I have this feeling that the simplest fix is just to leave > the race open, and make truncate_complete_page() just leave such a "racy" > page in the page cache. It will still race, and the invalid page will > still exist, but the end result should be harmless. Provided that we clean it - why the hell do we want to take it out of the pagecache? I don't see any fundamental reasons to prohibit pages past the ->i_size being hashed. Methods _must_ check for ->i_size, but they do it anyway. All race-prevention is based on page locks and ->i_sem. Yes, filemap_nopage() should check for i_size at the very end and fail if the page became off-limits. But that's completely unrelated issue - it's mmap semantics, not pagecache one. The point being: we should _never_ drop ->mapping unless the page is irrevocably going away. We can (and probably should) drop the off-limits page as soon as ->count hits zero, but we should not do it before that. Comments? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > I'll take a better look at the truncate case (I consider the invalidate > > case closed). Do you have a simple test-program around? > > Well, I cannot say simple. As I was not able to reproduce it with only > one task, code below: Ok, without running this I can already guess at what's up. One process does the "truncate()", and races with another process that does a page-in. The truncate code does notify_change(ATTR_SIZE); which will basically cause a "vmtruncate(inode, newsize)". Just before this happened, the other process gets a page fault, and does a page_cache_read(). Now, because we are low on memory (this is why it only shows up when you're swapping), that read will take a while. During that time, the vmtruncate() starts to execute, and sets inode->i_size to the new value (that would cause us to no longer accept a page fault - but we already got past that check in the faulting process). It will then invalidate all the inode pages > i_size. Now, the page faulting process comes back, and puts the page into the VM space. Never mind that it has in the meantime gotten i_mapping = NULL due to the other process doing the truncate. Now, if I'm right, you should be able to add something like if (!old_page->mapping) printk("mapping went away from under us\n"); to just before the "return old_page()" case in the success path of filemap_nopage() (mm/filemap.c), and you should see that printk() trigger when the bug happens. (There are other users too that are not synchronized wrt inode size changes and could get an access to a page past the end of the file this way - I just think that filemap_nopage is probably the one where this is most easily seen). The above is obviously not a bug-fix, it's just a validation of the theory. Al, any ideas? I have this feeling that the simplest fix is just to leave the race open, and make truncate_complete_page() just leave such a "racy" page in the page cache. It will still race, and the invalid page will still exist, but the end result should be harmless. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 12:09, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > > > Hi Linus, > > unfortunately, this does not explain problem I reported. In my case, > > underlying fs is ext2, and there is no locking around - only one task > > does ftruncate() on (big) shareable mapped file (original code does it to > > prevent dirty pages to be written to disk after their contents is no > > longer relevant). > > Yes. Yours is a separate problem, and is not due to invalidate_pages() at > all, but apparently due to truncate_complete_page(). > > Basically, all the same old arguments do apply - we cannot just remove the > page from the page cache if it is mapped. > > Now, the truncate() case is different, because the code _tries_ to also > zap the actual mappings. The fact that it fails to do so is a bit > unnerving, actually. > > I'll take a better look at the truncate case (I consider the invalidate > case closed). Do you have a simple test-program around? Well, I cannot say simple. As I was not able to reproduce it with only one task, code below: (1) mlocks itself in memory (you can remove it) (2) creates 224MB file and maps it shareable into memory (224 is for 256MB machine; I was not able to reproduce it with less than 100MB file on 256MB machine) (and you must have 224MB of free disk space in current directory, otherwise...) and unlinks it (3) fork (4) child process dirties memory (5) parent signals child to exit and wait 5 secs (6) child exits (hopefully...) (7) parent unmaps data (8) parent exits You'll always get page->mapping == NULL dereference... And btw, if I start code below on my machine (dual PIII/450), after 2 secs of run whole machine stops - only ping and console switching works - and stays that way for more than 20 secs. After that program finishes (killing kernel if no antioops in filemap_write_page present...) and everything returns to normal (if "-/+ buffers/cache: 28848 226920" is normal ;-) ). You can look at "http://platan.vc.cvut.cz/listit" for dump of "struct page" at time when page->mapping == NULL was catched in filemap_write_page, or through linux-kernel@ archive searching for 'page->mapping == NULL' during last month, either from me or from Quintella. Thanks, Petr Vandrovec [EMAIL PROTECTED] #include #include #include #include #include #include #include #include #include #include unsigned char zero; #define MSIZE 0x0E00 void x4768(void) { int fd; pid_t pid; int x4778_1[2]; int x4778_2[2]; int from4778; int to4778; char* mma[65536]; unsigned int mml[65536]; unsigned int mmi = 0; unsigned int ln = 0; unsigned int x; #define MAL2(a,l) ftruncate(fd, ln+l); mml[mmi] = l; mma[mmi++] = mmap(a, l, PROT_READ|PROT_WRITE, MAP_SHARED, fd, ln); ln += l; mma[mmi-1][0] = 99; #define MAL(l) MAL2(NULL,l) fd = open("ram0", O_RDWR | O_CREAT, 0600); unlink("ram0"); MAL(MSIZE); pipe(x4778_1); pipe(x4778_2); pid = fork(); if (!pid) { int from4768 = x4778_1[0]; int to4768 = x4778_2[1]; close(x4778_1[1]); close(x4778_2[0]); close(fd); read(from4768, &zero, 1); for (x = 0; x < mml[mmi - 1]; x += 4096) mma[mmi-1][x] = 98; write(to4768, &zero, 1); read(from4768, &zero, 1); exit(0); } else if (pid < 0) { perror("fork failed"); exit(222); } to4778 = x4778_1[1]; from4778 = x4778_2[0]; close(x4778_1[0]); close(x4778_2[1]); { write(to4778, &zero, 1); read(from4778, &zero, 1); write(to4778, &zero, 1); sleep(5); ftruncate(fd, 0); } while (mmi--) { munmap(mma[mmi], mml[mmi]); } exit(0); } int main(int argc, char* argv[]) { printf("Go\n"); mlockall(MCL_CURRENT); x4768(); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > Hi Linus, > unfortunately, this does not explain problem I reported. In my case, > underlying fs is ext2, and there is no locking around - only one task > does ftruncate() on (big) shareable mapped file (original code does it to > prevent dirty pages to be written to disk after their contents is no > longer relevant). Yes. Yours is a separate problem, and is not due to invalidate_pages() at all, but apparently due to truncate_complete_page(). Basically, all the same old arguments do apply - we cannot just remove the page from the page cache if it is mapped. Now, the truncate() case is different, because the code _tries_ to also zap the actual mappings. The fact that it fails to do so is a bit unnerving, actually. I'll take a better look at the truncate case (I consider the invalidate case closed). Do you have a simple test-program around? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 10:33, Linus Torvalds wrote: > Trond Myklebust <[EMAIL PROTECTED]> wrote: > > > >As for simply settling for a self-consistent mmap() rather than > >tackling the problem of rereading; the main crime is that you're > >rendering file locking unusable. ... > This is not a crime. > ... > And yes, I know that file locking can try to be nice in both of these > cases. I know that there are systems that try to revert shared mappings > etc upon file locking (or not allow file locking at all if mappings are > present). Linux is not one of them, and never has been. Neither are > most UNIXes out there. > Linus Hi Linus, unfortunately, this does not explain problem I reported. In my case, underlying fs is ext2, and there is no locking around - only one task does ftruncate() on (big) shareable mapped file (original code does it to prevent dirty pages to be written to disk after their contents is no longer relevant). And for some reason, not all instance of shared mapping are invalidated - some are still connected to underyling pages, but these pages have page->mapping = NULL... (and no, in testcase there is no vmware around). So on task exit kernel dies with page->mapping dereference in filemap_write_page (->filemap_sync->filemap_unmap->exit_mmap-> ->mmput->do_exit)... Thanks, Petr Vandrovec [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
In article <[EMAIL PROTECTED]>, Trond Myklebust <[EMAIL PROTECTED]> wrote: > >As for simply settling for a self-consistent mmap() rather than >tackling the problem of rereading; the main crime is that you're >rendering file locking unusable. This is not a crime. Anybody who uses file locking over NFS is buggy and nobody sane should expect it to work reliably. There are good reasons why mail agents etc depend on other kinds of locking over NFS. Furthermore, anybody who expects file locking to work _anywhere_ (ie including local filesystems) in the presense of file mapping is just so out to lunch that it's ridiculous. Neither of these issues have anything to do with Linux. They are just statements of fact. And yes, I know that file locking can try to be nice in both of these cases. I know that there are systems that try to revert shared mappings etc upon file locking (or not allow file locking at all if mappings are present). Linux is not one of them, and never has been. Neither are most UNIXes out there. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Thu, 19 Oct 2000, Linus Torvalds wrote: > You'd have to do something like > > LockPage(page); /* Nobody gets to write to this page (except >through mmaps, ugh) */ > gather_all_mmap_users(page);/* THIS is the nasty one */ Wait a second. invalidate_inode_pages() has no idea of range, right? Finding all VMAs of shared mappings for given inode is not a big deal. Sure, repeating it for each bloody page would be painful at extreme, but "make sure that every access to address within _that_ VMA will result in pagefault" looks like a reasonable operation. Basically, we have two sides - pagecache and many VMAs. And loop over VMAs with per-VMA operations sounds more reasonable than loop over pages. We _can_ block pageins without messing with pages themselves (it starts with finding VMA, after all) and we can block new shared mappings (not a big deal). Comments? Basically, I propose per-VMA rw-semaphore taken on page-in for read and on that "flush and make sure we'll reread" operation for write + rw-sem on i_mmap_shared. We could even make invalidation work for less than whole file, all we need for that is skipping the VMAs out of the range. Ingo, Linus? > nfs_wb_page(page); /* force write-back on this page */ > ClearPageUptodate(page);/* mark it not up-to-date to force a read-in >next time */ > UnlockPage(page); /* Ok, now the client can go wild */ > > where everything but the "gather_all_mmap_users()" part is fairly > straightforward. The "gather" phase is nasty - it would need to figure > out every place the page is mapped, make sure those are synchronized (ie > something like marking the page table entry write-protected and causing a > TLB invalidate SMP cross-call - at which point the resulting page fault > and the page lock will catch anybody who tries to write to the page).. > In no case could you do something like what the current > invalidate_inode_pages() does, which is to just try to drop the page from > the cache - that really only works if we're the only user of that page, > which the "page_count != 1" test now enforces. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Alexander Viro <[EMAIL PROTECTED]> writes: > Again, consider the case when two processes share the > mapping. Process A has page faulted in. Page is > invalidated. Process B tries to access the same page. If you > leave it in page tables of A you _MUST_ leave it in > cache. Period. Otherwise A and B will have different instances > of the page. Even so, you want to reread it. When I said automatically msync(), I meant 'schedule a write of what has changed and then do whatever you need to do to get the page reread'. (You explicitly asked for semantics, not an implementation) > And rereading the thing might be tolerable _only_ if there is > another client that had changed the file. Even if you msync() > everything, you have to deal with plain and boring memory > modifications done by a process that did that bloody mmap(). If > they happen while you are reading the data from server - too > fscking bad, you'ld better have a good excuse for destroying > the data. write() from another client _is_ a good excuse. But > from my reading of fs/nfs/* it looks like we do that (cache > invalidation) left, right and center in cases that have nothing > to another clients. That's because under NFS you don't have a cache consistency protocol. Nothing tells you that the file/directory has changed and that you have to resync your cache. Instead, you have to infer it from the fact that some operation has returned file attributes that are screwy. In addition you may want to force a reread, because some operation just changed a directory on the server, and you don't know what else changed beforehand. > IOW, I think that invalidate_inode_pages() is bogus. There is > only one situation when we have a right to remove page from > pagecache - when it is not mapped anywhere. The issue is not about removing pages. It's about forcing a reread of the cached data from the server. Removing the actual pages from the cache has so far been the only race-free method for doing this (since pre-2.2.x at least) while ensuring that at least generic 'read', 'readdir' and 'write' work as expected. Yes it screws up mmap() and should be fixed but without breaking what little that works please. As for simply settling for a self-consistent mmap() rather than tackling the problem of rereading; the main crime is that you're rendering file locking unusable. Locking is the case in which you have to issue a guarantee that the cache is consistent between client and server within the area covered by the lock. In all other cases you *could* get away with the partial cache invalidation implementation by arguing that there no consistency guarantees inherent in the protocol. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Fri, 20 Oct 2000, Trond Myklebust wrote: > For the general case of the page cache I think we can keep them quite > simple: > > + We do in any case want to drop all pages that are unreferenced. (The > reason for flushing may be that the file size has changed.) > > + For pages that are referenced (and unlocked) we would like to force > them to get read in anew ASAP. How this is done in practice is > irrelevant as far as NFS is concerned provided that we don't sleep on > any I/O while in nfs_zap_caches()/invalidate_inode_pages(). > > The lower level stuff can and will sort out the business of flushing > out pending writebacks that conflict with the read, so that isn't a > problem for the VFS/VM. > > The problem lies with writes that haven't yet been msync()ed (and > hence do not have writebacks). For shared mappings, one should perhaps > schedule an automatic msync() of the dirty pages (???). For private > mappings, perhaps the best thing would be to defer the read? Again, consider the case when two processes share the mapping. Process A has page faulted in. Page is invalidated. Process B tries to access the same page. If you leave it in page tables of A you _MUST_ leave it in cache. Period. Otherwise A and B will have different instances of the page. It's not about writebacks. If you map something with MAP_SHARED and fork() afterwards you _MUST_ have the same data at the address returned by mmap() until one of the processes unmaps the thing. And rereading the thing might be tolerable _only_ if there is another client that had changed the file. Even if you msync() everything, you have to deal with plain and boring memory modifications done by a process that did that bloody mmap(). If they happen while you are reading the data from server - too fscking bad, you'ld better have a good excuse for destroying the data. write() from another client _is_ a good excuse. But from my reading of fs/nfs/* it looks like we do that (cache invalidation) left, right and center in cases that have nothing to another clients. IOW, I think that invalidate_inode_pages() is bogus. There is only one situation when we have a right to remove page from pagecache - when it is not mapped anywhere. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 20 Oct 2000, Trond Myklebust wrote: > > " " == Russell King <[EMAIL PROTECTED]> writes: > > > invalidate_inode_pages nfs_zap_caches nfs_lock fcntl_setlk > > do_fcntl sys_fcntl > > > So I guess that NFS locking is really bad if the region is > > mmapped! > > Yep, but that's a symptom, not a cause. We want to be able to run > invalidate_inode_pages() safely at any moment, since the need can be > triggered externally (because the server and client page caches > disagree). So what exactly do you want it to do when page is mapped by user process? Should it remain visible or not? What should happen if process writes to that page? Trond, I'm not asking about implementation - the question being what semantics do you want for nfs_zap_caches() wrt user-mapped pages. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 20 Oct 2000, Trond Myklebust wrote: > Under NFS the problem is that pages can (and *should*) be invalidated > despite there being pending write backs. The server can trigger the > need for a cache invalidation at any time. OK, so what should happen if user does mmap() on NFS file, dirties the page and server tells that page should be invalidated? Should we still see that page in process' address space? Silently making it anonymous is the worst thing possible - it's still accessible through process page tables and you (a) still see the (allegedly) invalidated data and (b) have no way in hell to get the cache coherency between processes. I.e. you are in for situations when we do mmap(), fork(), write data in child and attempt to read from that address in parent and child yield different results. Moreover, future attempts to modify the data by either of them will be invisible in another. IOW, what do you really want from the invalidation? It looks like the right thing _might_ be "remove from all page tables", but that can become really interesting with mlock() and friends. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Fri, 20 Oct 2000, Roger Larsson wrote: > Is it legal/good practice to unmap the file after closing it? Yes. > (Since the sharing needs the fd to mmap it) It doesn't. Mapping needs struct file * and it doesn't care about fd. mmap() takes a reference to struct file by fd you've passed and after that we can forget about descriptors - vma_struct holds a reference to file and that's it. > Successful unlinking a file should probably free pages directly to > free list - might be worth optimizing for. _Definitely_ no. Hell, unlink() doesn't mean that anything happens with data - unlink an opened file and if that affects the read/write/lseek you've found a bug. And mapping is equivalent to having opened descriptors in that respect. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Fri, 20 Oct 2000, Trond Myklebust wrote: > > " " == David S Miller <[EMAIL PROTECTED]> writes: > > > Actually, judging by the trace you provided Russell, I'd say > > this is some peculiarity with NFS silly rename handling, and > > it'd be best to look for the bug in that code (early inode > > reference loss, for example?) > > Russel's trace indicates that the unlink actually has completed and > has become a negative dentry since the file is labelled '(deleted)'. No, it doesn't. Proof: exec 3>/tmp/foo rm /tmp/foo ls -l /proc/$$/fd/3 and dentry is _not_ negative. > That means that the dentry count must have been zero, so that > dentry_iput() was called. It doesn't and it wasn't. > I don't see how dentry_iput() can be called on an open file. In > principle the dentry count should always be >= 1, so unless there is > some place where we're calling d_delete() without get()ing the dentry > first, there should be no path for early inode loss. d_delete() will not make dentry negative if there are other users. However, it will make it unhashed in such case, ergo (deleted) thing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
[ Final comment, and then I'll shut up ] On Thu, 19 Oct 2000, Linus Torvalds wrote: > > You'd have to do something like > > LockPage(page); /* Nobody gets to write to this page (except >through mmaps, ugh) */ > gather_all_mmap_users(page);/* THIS is the nasty one */ > nfs_wb_page(page); /* force write-back on this page */ > ClearPageUptodate(page);/* mark it not up-to-date to force a read-in >next time */ > UnlockPage(page); /* Ok, now the client can go wild */ Note that one approach would be to just make invalidate_inode_pages() do exactly the above. Now, the gather_all_mmap_users() part is definitely a post-2.4.x thing, and we can't do nfs_wb_page() in invalidate_inode_page() either, but what we CAN do is to clear the Uptodate flag (and we do need the page lock to do so). The advantage of clearing the uptodate flag (as opposed to doing what we do now - dropping the page altogether) is that there would be no cache aliasing issues, and there would be no issues with a page and its associated data just "disappearing" from under somebody. It would cause the page to be read in again the next time it is faulted in or somebody does a read() on it, and that's exactly what we want. However, we _do_ need to WB the page data some way - but the decision on whether to invalidate write-backs or finish them would have to be up to the low-level filesystem. (We should _not_ allow people to read in the page, and leave some stale write-back information there, so that we'd write back stuff that we just read because somebody else noticed that the page was not up-to-date. That would be just an endless source of (a) confusion and (b) unnecessary network traffic.) We could, of course, have some combination of the two: if there is only one user (us), we can just drop the page, otherwise we can mark it non-up-to-date to force future readers to re-validate the dang thing. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Thu, 19 Oct 2000, Linus Torvalds wrote: > > I'm saying that we're much much better off guaranteeing local consistency > over knowingly breaking local consistency over a uncertain global > consistency issue. Especially as NFS has never guaranteed global > consistency in the first place, and was not designed to do that. Btw, if you really worry about NFS v3 and want to add consistency guarantees to NFS v3, you can probably do that. It's just that you cannot do it with a simple cache invalidation - you'd need to be much much more careful (the same way a simple CPU cache invalidate does not guarantee SMP cache consistency - it only guarantees data corruption due to missed writes). I don't know what the best way to do a true cache consistency protocol on a page leve would be, especially as you will have a _really_ hard time getting hold of an exclusive pointer to a page with multiple mmap's, but you could probably get something that guarantees cache consistency as long as people do _not_ expect mmap to always be 100% consistent. (The reason you cannot get mmap consistency is just that mmap doesn't have any kernel synchronization point unless you're willing to shoot down every single mapping - which is expensive as hell, but doable). You'd have to do something like LockPage(page); /* Nobody gets to write to this page (except through mmaps, ugh) */ gather_all_mmap_users(page);/* THIS is the nasty one */ nfs_wb_page(page); /* force write-back on this page */ ClearPageUptodate(page);/* mark it not up-to-date to force a read-in next time */ UnlockPage(page); /* Ok, now the client can go wild */ where everything but the "gather_all_mmap_users()" part is fairly straightforward. The "gather" phase is nasty - it would need to figure out every place the page is mapped, make sure those are synchronized (ie something like marking the page table entry write-protected and causing a TLB invalidate SMP cross-call - at which point the resulting page fault and the page lock will catch anybody who tries to write to the page).. In no case could you do something like what the current invalidate_inode_pages() does, which is to just try to drop the page from the cache - that really only works if we're the only user of that page, which the "page_count != 1" test now enforces. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Linus Torvalds <[EMAIL PROTECTED]> writes: > which is really really bad, because now you have the case that > you have 'n' copies of the same page in memory, with 'n' users, > out of which 'n-1' users have the wrong page. And those 'n-1' > users don't even have any way of _knowing_ that they have the > wrong page. > Which is why we MUST NOT drop a page that has users. Really. > I'm telling you that cases #4 and #5 are _much_ worse than your > "solution" to case #2. And you argue that your solution is good > only because you're completely ignoring cases #4 and #5. No. I'm arguing (at 4:40am and while trying to keep one eye on our detector's data acquisition) on the basis that whoever holds the file lock has to have a guarantee of obtaining 100% accuracy on the locked region. I agree that dropping pages is ugly and that it will always give problems with shared mmap(), so if it can be shown that clearing PG_uptodate and rereading the same page will give the required guarantee on locking, then I'm not going to complain. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 20 Oct 2000, Trond Myklebust wrote: > > The problem here is that NFS pages have 3 rather than 2 states: > 1) mmapped & correct. > 2) mmapped & incorrect. (but possibly dirty) > 3) Unmapped > > For case 1), we clearly want to have the page in inode->i_mapping. > For cases 2) & 3) we don't. I think you're WRONG, WRONG, WRONG. Your case #2 is not at all a state. It's "mapped, possibly dirty, and locally correct. But _maybe_ globally incorrect". If you remove the page in case #2 from the hashing, you will have case #4, which you're ignoring: 4) Totally, and utterly incorrect. Without any way for the application to even _know_ that it's incorrect. Note that case 4 is accompanied by case #5, later on: 5) two separate pages both mapped, one hashed and up-to-date, the other mapped in other processes but incorrect. which is really really bad, because now you have the case that you have 'n' copies of the same page in memory, with 'n' users, out of which 'n-1' users have the wrong page. And those 'n-1' users don't even have any way of _knowing_ that they have the wrong page. Which is why we MUST NOT drop a page that has users. Really. I'm telling you that cases #4 and #5 are _much_ worse than your "solution" to case #2. And you argue that your solution is good only because you're completely ignoring cases #4 and #5. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 20 Oct 2000, Trond Myklebust wrote: > > > invalidate_inode_pages nfs_zap_caches nfs_lock fcntl_setlk > > do_fcntl sys_fcntl > > > So I guess that NFS locking is really bad if the region is > > mmapped! > > Yep, but that's a symptom, not a cause. We want to be able to run > invalidate_inode_pages() safely at any moment, since the need can be > triggered externally (because the server and client page caches > disagree). Well, the thing is, that if somebody has a page mapped, there's nothing we can do if the server and client disagrees. We just cannot invalidate the page - people would actually lose data, and would lose major local consistency guarantees that UNIX filesystems are supposed to get. So it's a matter of one sort of consistency against another - and I think the local consistency requirements are the stricter ones, considering that we don't even _know_ whether the server has really changed those pages or not.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 20 Oct 2000, Trond Myklebust wrote: > > Under NFS the problem is that pages can (and *should*) be invalidated > despite there being pending write backs. The server can trigger the > need for a cache invalidation at any time. > The existence of file locks that aren't page aligned, as well as > partial page writeback ensure that we cannot make the equivalence >page has pending write == page is correct. Note that this is not what my patch really says at all. My patch says "ok, we should invalidate this page, because somebody thinks it may be bad. HOWEVER, we cannot reasonably do that, because we have users of the page (which may be completely unrelated to pending writes), and invalidating this page is _certain_ to cause corruption". Basically think of the case of somebody having a shared mmap over NFS. He has dirty data in his pages, and he expects those to be written out to the server. There is NO QUESTION about this fact. Imagine somebody (possibly even the same person) then releasing or getting a file lock, possibly in some other part of the file. Without that added test, all those dirty local pages just got completely thrown away. Now, you have a choice: you can KNOW that you're doing something horribly and utterly wrong (throwing away data), or you can SUSPECT that you might cause non-coherency between the client and the server. You cannot get both. I'm saying that we're much much better off guaranteeing local consistency over knowingly breaking local consistency over a uncertain global consistency issue. Especially as NFS has never guaranteed global consistency in the first place, and was not designed to do that. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Linus Torvalds <[EMAIL PROTECTED]> writes: > The advantage of clearing the uptodate flag (as opposed to > doing what we do now - dropping the page altogether) is that > there would be no cache aliasing issues, and there would be no > issues with a page and its associated data just "disappearing" > from under somebody. It would cause the page to be read in > again the next time it is faulted in or somebody does a read() > on it, and that's exactly what we want. I've proposed this in the past, but there you refused on the grounds that it breaks assumptions in the VFS. I'm otherwise happy with this sort of thing. It ensures both shared mmap() and cache consistency. The only problem I can think of would be for dirty pages that haven't yet been msync()ed, however I'm far from being competent to evaluate any side-effects on the VM. > However, we _do_ need to WB the page data some way - but the > decision on whether to invalidate write-backs or finish them > would have to be up to the low-level filesystem. This is already in place. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Linus Torvalds <[EMAIL PROTECTED]> writes: > Btw, that "invalidate_inode_pages()" thing is just wrong - we > can't just remove pages that are mapped etc, because that would > result in no end of fun aliasing problems etc. > How about adding a test in invalidate_inode_pages() like > /* We cannot invalidate a locked page */ if > (TryLockPage(page)) > continue; > + /* We cannot invalidate a page that is in use */ > + if (page_count(page) != 1) { > + UnlockPage(page); > + continue; > + } > + > __lru_cache_del(page); __remove_inode_page(page); The problem here is that NFS pages have 3 rather than 2 states: 1) mmapped & correct. 2) mmapped & incorrect. (but possibly dirty) 3) Unmapped For case 1), we clearly want to have the page in inode->i_mapping. For cases 2) & 3) we don't. However for case 2) we still have a weak association to the inode itself, and we want to be able to reference inode metadata etc. Would it make sense then to remove these pages from i_mapping, but to hang them onto a new struct address_space (call it i_unmapped for want of a better name)? That would allow you to keep a consistent state for the page, while still allowing you to 'invalidate' it (by removing it from the i_mapping) and hence maintain a consistent cache. invalidate_inode_pages() would then reduce to remove_page_from_inode_queue(page); remove_page_from_hash_queue(page); if (page_count(page)) add_page_to_inode_unmapped(page); Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Alexander Viro writes: > Trond, I'm not asking about implementation - the question being what > semantics do you want for nfs_zap_caches() wrt user-mapped pages. Ok, looking through sendmail, and then db2, the situation is created by the db2 library. If the process does the following: 1. creates NFS file 2. sets file size by lseek + write 3. maps it 4. fcntl locks the file (writes above data back to file) 5. write to the mapping (makes pages dirty) 6. fcntl unlocks it (dirty page data is NOT written back) 7. closes it 8. unlinks it 9. some time later unmaps it(causes dirty data to be written back) Note that (6) doesn't act as a barrier to synchronise writes in this case, but it does for any normal write()s. Surely NFS should cause any dirty data associated with the file to be written back to the server no matter what? Although Linus' fix seems to prevent the problem, I get the feeling that it is a sticky plaster over a much bigger problem. _ |_| - ---+---+- | | Russell King[EMAIL PROTECTED] --- --- | | | | http://www.arm.linux.org.uk/personal/aboutme.html / / | | +-+-+ --- -+- / | THE developer of ARM Linux |+| /|\ / | | | --- | +-+-+ - /\\\ | - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 19 Oct 00 at 16:32, Linus Torvalds wrote: > How about adding a test in invalidate_inode_pages() like > > /* We cannot invalidate a locked page */ > if (TryLockPage(page)) > continue; > > + /* We cannot invalidate a page that is in use */ > + if (page_count(page) != 1) { > + UnlockPage(page); > + continue; > + } > + > __lru_cache_del(page); > __remove_inode_page(page); Hi Linus, this did not fix problem with my testcase - mmap(shared),fork,dirty_it,ftruncate,finish - does NOT go through this code (invalidate_inode_pages) at all (should it?). I instrumented both swapout in page_io, and __remove_inode_page, and pages are going through __remove_inode_page with strange use count. I do not know whether it is good or no... During normal life of system, pages with count==2 && index==0 are passed to __remove_inode_page. As soon as I start my test, non-zero index pages with count > 2 are passed here. Later this pages appear in filemap_sync... Thanks, Petr Vandrovec [EMAIL PROTECTED] vana:~# free total used free sharedbuffers cached Mem:255768 203232 52536 0 4084 35496 -/+ buffers/cache: 163652 92116 Swap: 530136 0 530136 vana:~# free total used free sharedbuffers cached Mem:255768 203376 52392 0 4084 35636 -/+ buffers/cache: 163656 92112 Swap: 530136 0 530136 vana:~# free total used free sharedbuffers cached Mem:255768 72172 183596 0 4084 32888 -/+ buffers/cache: 35200 220568 Swap: 530136 0 530136 vana:~# free total used free sharedbuffers cached Mem:255768 72180 183588 0 4084 32888 -/+ buffers/cache: 35208 220560 Swap: 530136 0 530136 11:57:44 Bad page count in __remove_inode_page! 11:57:44 page: c1349ef8 11:57:44 mapping: cc65c6dc 11:57:44 index:0 11:57:44 nexthash: 11:57:44 count:2 11:57:44 flags:0x0009 11:57:44 age: 0 11:57:44 pprevhash: 11:57:44 buffers: 11:57:44 virtual: cc61a000 11:57:44 zone: c021f578 11:57:44 pagedump done 11:57:44 Bad page count in __remove_inode_page! 11:57:44 page: c1120394 11:57:44 mapping: cc6b2edc 11:57:44 index:32767 11:57:44 nexthash: 11:57:44 count:3 11:57:44 flags:0x0009 11:57:44 age: 5 11:57:44 pprevhash: 11:57:44 buffers: 11:57:44 virtual: c43d1000 11:57:44 zone: c021f578 11:57:44 pagedump done 11:57:46 Bad page count in __remove_inode_page! 11:57:46 page: c1266298 11:57:46 mapping: cc6b2edc 11:57:46 index:13635 11:57:46 nexthash: 11:57:46 count:3 11:57:46 flags:0x0009 11:57:46 age: 5 11:57:46 pprevhash: 11:57:46 buffers: 11:57:46 virtual: c9082000 11:57:46 zone: c021f578 11:57:46 pagedump done 11:57:46 page->mapping == NULL 11:57:46 error: -22 11:57:46 ptep: c92f6a2c 11:57:46 pteval: 09062027 11:57:46 vma:ce926420 11:57:46 vm_mm:ccd67d60 11:57:46 vm_start: 40128000 11:57:46 vm_end: 48128000 11:57:46 vm_next: c1490b20 11:57:46 vm_avl_height: 1 11:57:46 vm_avl_left: 11:57:46 vm_avl_right: 11:57:46 vm_next_share: 11:57:46 vm_pprev_share: ccdee684 11:57:46 vm_operations_struct: c021f1a0 11:57:46 vm_pgoff: 11:57:46 vm_file: cd4270e0 11:57:46 vm_raend: 11:57:46 vm_private_data: 11:57:46 address:4368B000 11:57:46 flags: 0001 11:57:46 file: cd4270e0 11:57:46 dentry: cc8ae440 11:57:46 inode: cc6b2e40 11:57:46 num: 848677 11:57:46 dev: 0x0302 11:57:46 path: /usr/src/tst/ram0 (deleted) 11:57:46 vfsmount: c14ca8c0 11:57:46 op: c0221bc0 11:57:46 count:4 11:57:46 flags:0x0002 11:57:46 mode: 03 11:57:46 pos: 0 11:57:46 reada:0 11:57:46 ramax:0 11:57:46 raend:0 11:57:46 ralen:0 11:57:46 rawin:0 11:57:46 owner.pid: 0 11:57:46 owner.uid: 0 11:57:46 owner.euid: 0 11:57:46 owner.signum: 0 11:57:46 uid: 0 11:57:46 gid: 0 11:57:46 error:0 11:57:46 version: 11641 11:57:46 private_data: 11:57:46 page: c1265a18 11:57:46 mapping: 11:57:46 index:
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Alexander Viro <[EMAIL PROTECTED]> writes: > So what exactly do you want it to do when page is mapped by > user process? Should it remain visible or not? What should > happen if process writes to that page? > Trond, I'm not asking about implementation - the question being > what semantics do you want for nfs_zap_caches() wrt user-mapped > pages. For the general case of the page cache I think we can keep them quite simple: + We do in any case want to drop all pages that are unreferenced. (The reason for flushing may be that the file size has changed.) + For pages that are referenced (and unlocked) we would like to force them to get read in anew ASAP. How this is done in practice is irrelevant as far as NFS is concerned provided that we don't sleep on any I/O while in nfs_zap_caches()/invalidate_inode_pages(). The lower level stuff can and will sort out the business of flushing out pending writebacks that conflict with the read, so that isn't a problem for the VFS/VM. The problem lies with writes that haven't yet been msync()ed (and hence do not have writebacks). For shared mappings, one should perhaps schedule an automatic msync() of the dirty pages (???). For private mappings, perhaps the best thing would be to defer the read? Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Fri, 20 Oct 2000, Trond Myklebust wrote: > > The problem lies with writes that haven't yet been msync()ed (and > hence do not have writebacks). For shared mappings, one should perhaps > schedule an automatic msync() of the dirty pages (???). For private > mappings, perhaps the best thing would be to defer the read? Note that NONE of this is going to happen for 2.4.x. We've never _ever_ done this before, there's no point in even suggesting that this is suddenly a "critical" bug. It's not. I want to know what the suggestion for 2.4.x is. Right now that's the "if the count is elevated, we don't invalidate". Quite frankly, I don't see any other option. Doing the !Uptodate version will lose local data as it stands now - in fact right now you'd lose data that way even if you are the only client accessing the file, which is obviously complete crap and _completely_ unacceptable. I'm open to suggestions, but I haven't heard anything realistic. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Linus Torvalds <[EMAIL PROTECTED]> writes: > How about adding a test in invalidate_inode_pages() like > /* We cannot invalidate a locked page */ if > (TryLockPage(page)) > continue; > + /* We cannot invalidate a page that is in use */ > + if (page_count(page) != 1) { > + UnlockPage(page); > + continue; > + } > + > __lru_cache_del(page); __remove_inode_page(page); > Because otherwise we might end up invalidating pages that might > have pending write-back etc (although I think the NFS logic > tries to avoid invalidating when there's pending activity). You may remember that I've been fighting a rearguard action against this for some time. Under NFS the problem is that pages can (and *should*) be invalidated despite there being pending write backs. The server can trigger the need for a cache invalidation at any time. The existence of file locks that aren't page aligned, as well as partial page writeback ensure that we cannot make the equivalence page has pending write == page is correct. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Russell King <[EMAIL PROTECTED]> writes: > invalidate_inode_pages nfs_zap_caches nfs_lock fcntl_setlk > do_fcntl sys_fcntl > So I guess that NFS locking is really bad if the region is > mmapped! Yep, but that's a symptom, not a cause. We want to be able to run invalidate_inode_pages() safely at any moment, since the need can be triggered externally (because the server and client page caches disagree). Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Russell King <[EMAIL PROTECTED]> writes: > Indeed. page->mapping is set to NULL in two places, one in > __remove_inode_pages() and the other one in the swap code after > we've checked that it was NULL. I hadn't found the particular > call trace that caused us to ended up in __remove_inode_page() > though with this page. Have you tried tagging invalidate_inode_pages() with a debugging statement? Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == David S Miller <[EMAIL PROTECTED]> writes: > Actually, judging by the trace you provided Russell, I'd say > this is some peculiarity with NFS silly rename handling, and > it'd be best to look for the bug in that code (early inode > reference loss, for example?) Russel's trace indicates that the unlink actually has completed and has become a negative dentry since the file is labelled '(deleted)'. That means that the dentry count must have been zero, so that dentry_iput() was called. I don't see how dentry_iput() can be called on an open file. In principle the dentry count should always be >= 1, so unless there is some place where we're calling d_delete() without get()ing the dentry first, there should be no path for early inode loss. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Linus Torvalds writes: > Can you get a full stack trace? filemap_write_page filemap_sync filemap_unmap do_munmap sys_munmap > How about adding a test in invalidate_inode_pages() like (Added, along with a call to drop a stack trace out). Yes, this does stop the problem in filemap_write_page. The stack trace here shows: invalidate_inode_pages nfs_zap_caches nfs_lock fcntl_setlk do_fcntl sys_fcntl So I guess that NFS locking is really bad if the region is mmapped! _ |_| - ---+---+- | | Russell King[EMAIL PROTECTED] --- --- | | | | http://www.arm.linux.org.uk/personal/aboutme.html / / | | +-+-+ --- -+- / | THE developer of ARM Linux |+| /|\ / | | | --- | +-+-+ - /\\\ | - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
In article <8snvuj$1l0$[EMAIL PROTECTED]>, Linus Torvalds <[EMAIL PROTECTED]> wrote: > >Hmm.. Looks like page->mapping was cleared by truncate_inode_pages() >when the inode was free'd, and there was still write-back activity on >one of the pages in question. Looking some more, the fact that the oops happens in filemap_sync_pte() definitely means that we haven't gotten around to truncating the sucker yet. So invalidate_inode_pages() looks like the more likely cause, especially as the page writeback correctly increases the file count for NFS, so we should never be able to get into the horror-schenario of having write-back after the pages have been truncated due to the inode getting cleared. Does the simple added test for page counts to invalidate_inode_pages() fix the problem for you, Russell? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
In article <[EMAIL PROTECTED]>, Trond Myklebust <[EMAIL PROTECTED]> wrote: >> " " == Petr Vandrovec <[EMAIL PROTECTED]> writes: > > > You do not have to use NFS - look for my postings with > > 'page->mapping == NULL' in archive. Your code uses shared mmap, > > 'page->isn't > > it? Probably shared by couple of processes. > >It's probably particularly nasty under NFS because of >invalidate_inode_pages(). Btw, that "invalidate_inode_pages()" thing is just wrong - we can't just remove pages that are mapped etc, because that would result in no end of fun aliasing problems etc. I don't see why invalidate_inode_pages() would be called for this sequence, though, which is why I'd be more likely to blame the truncate case, but I think you're right about invalidate_inode_pages() being a problem none-the-less. How about adding a test in invalidate_inode_pages() like /* We cannot invalidate a locked page */ if (TryLockPage(page)) continue; + /* We cannot invalidate a page that is in use */ + if (page_count(page) != 1) { + UnlockPage(page); + continue; + } + __lru_cache_del(page); __remove_inode_page(page); Because otherwise we might end up invalidating pages that might have pending write-back etc (although I think the NFS logic tries to avoid invalidating when there's pending activity). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
In article <[EMAIL PROTECTED]>, Russell King <[EMAIL PROTECTED]> wrote: >Petr Vandrovec writes: >> ... or from sys_exit() if you forget to unmap. Or from anywhere if >> swapping code decides to swap such page. I'm trying to hunt it down >> for more than month, but I have no idea what's wrong. In my case >> way to trigger bug is: > >I actually think its as simple as: > >1. shared map a file >2. close file >3. unlink file >4. unmap shared mapping > >In my case, I was running newaliases, and there wasn't any other >processes using that mapping. Hmm.. Looks like page->mapping was cleared by truncate_inode_pages() when the inode was free'd, and there was still write-back activity on one of the pages in question. Looking at the problem, it appears that the unmap() will (correctly) do a writepage() for each dirty page, and that will (again correctly) cause NFS to do a delayed write-out with all the proper things (increment page count because of the pending write etc). The inode hasn't been free'd yet, because we're still using it. But then when we _do_ release the inode, we end up doing truncate_inode_pages() and through that we clear page->mapping, so the pending write-out is SOL. I have this feeling that we probably should NOT remove the mapping until the page count has gone down to zero. The other alternative would be to flush the pending write-outs (or wait for them synchronously - but that's just stupid and inefficient), which would imply that we'd better add the "flush" operation to "struct address_space_operations". Can you get a full stack trace? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
From: Russell King <[EMAIL PROTECTED]> Date:Fri, 20 Oct 2000 00:07:55 +0100 (BST) Trond Myklebust writes: > It's probably particularly nasty under NFS because of > invalidate_inode_pages(). The latter empties the page cache whenever > we can no longer trust it and calls remove_inode_page() on every > unlocked page. It won't care whether the page is mmapped or not. > > My guess is therefore that the line setting 'page->mapping = NULL' in > __remove_inode_page() is a candidate for scrutiny... Indeed. page->mapping is set to NULL in two places, one in __remove_inode_pages() and the other one in the swap code after we've checked that it was NULL. I hadn't found the particular call trace that caused us to ended up in __remove_inode_page() though with this page. Actually, judging by the trace you provided Russell, I'd say this is some peculiarity with NFS silly rename handling, and it'd be best to look for the bug in that code (early inode reference loss, for example?) Later, David S. Miller [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Trond Myklebust writes: > It's probably particularly nasty under NFS because of > invalidate_inode_pages(). The latter empties the page cache whenever > we can no longer trust it and calls remove_inode_page() on every > unlocked page. It won't care whether the page is mmapped or not. > > My guess is therefore that the line setting 'page->mapping = NULL' in > __remove_inode_page() is a candidate for scrutiny... Indeed. page->mapping is set to NULL in two places, one in __remove_inode_pages() and the other one in the swap code after we've checked that it was NULL. I hadn't found the particular call trace that caused us to ended up in __remove_inode_page() though with this page. _ |_| - ---+---+- | | Russell King[EMAIL PROTECTED] --- --- | | | | http://www.arm.linux.org.uk/personal/aboutme.html / / | | +-+-+ --- -+- / | THE developer of ARM Linux |+| /|\ / | | | --- | +-+-+ - /\\\ | - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
> " " == Petr Vandrovec <[EMAIL PROTECTED]> writes: > You do not have to use NFS - look for my postings with > 'page->mapping == NULL' in archive. Your code uses shared mmap, > 'page->isn't > it? Probably shared by couple of processes. It's probably particularly nasty under NFS because of invalidate_inode_pages(). The latter empties the page cache whenever we can no longer trust it and calls remove_inode_page() on every unlocked page. It won't care whether the page is mmapped or not. My guess is therefore that the line setting 'page->mapping = NULL' in __remove_inode_page() is a candidate for scrutiny... Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Roger Larsson writes: > Will it work correctly if 4. is done before 3. (even before 2?) > Is it legal/good practice to unmap the file after closing it? > (Since the sharing needs the fd to mmap it) Dunno, haven't tried that yet. I'll have a go tomorrow, but I think it'll work correctly. I'll also dig about in the VFS / NFS code some more to see if I can find anything obvious. > I would have expected this order: > A. open file > B. shared map file > C. unmap shared mapping > D. close file > E. unlink file BTW, the program which is giving grief is good ol' sendmail, v8.11. It does the unmap after close all over the place. _ |_| - ---+---+- | | Russell King[EMAIL PROTECTED] --- --- | | | | http://www.arm.linux.org.uk/personal/aboutme.html / / | | +-+-+ --- -+- / | THE developer of ARM Linux |+| /|\ / | | | --- | +-+-+ - /\\\ | - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Russell King wrote: > > Petr Vandrovec writes: > > ... or from sys_exit() if you forget to unmap. Or from anywhere if > > swapping code decides to swap such page. I'm trying to hunt it down > > for more than month, but I have no idea what's wrong. In my case > > way to trigger bug is: > > I actually think its as simple as: > 0. open file > 1. shared map file > 2. close file > 3. unlink file > 4. unmap shared mapping Will it work correctly if 4. is done before 3. (even before 2?) Is it legal/good practice to unmap the file after closing it? (Since the sharing needs the fd to mmap it) I would have expected this order: A. open file B. shared map file C. unmap shared mapping D. close file E. unlink file But in your case the unlink should be deferred to unmap time... (if it is legal) Successful unlinking a file should probably free pages directly to free list - might be worth optimizing for. /RogerL Disclaimer: I do not really understand fs code... > > In my case, I was running newaliases, and there wasn't any other > processes using that mapping. >_ > |_| - ---+---+- > | | Russell King[EMAIL PROTECTED] --- --- > | | | | http://www.arm.linux.org.uk/personal/aboutme.html / / | > | +-+-+ --- -+- > / | THE developer of ARM Linux |+| /|\ > / | | | --- | > +-+-+ - /\\\ | > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ -- Home page: http://www.norran.net/nra02596/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Petr Vandrovec writes: > ... or from sys_exit() if you forget to unmap. Or from anywhere if > swapping code decides to swap such page. I'm trying to hunt it down > for more than month, but I have no idea what's wrong. In my case > way to trigger bug is: I actually think its as simple as: 1. shared map a file 2. close file 3. unlink file 4. unmap shared mapping In my case, I was running newaliases, and there wasn't any other processes using that mapping. _ |_| - ---+---+- | | Russell King[EMAIL PROTECTED] --- --- | | | | http://www.arm.linux.org.uk/personal/aboutme.html / / | | +-+-+ --- -+- / | THE developer of ARM Linux |+| /|\ / | | | --- | +-+-+ - /\\\ | - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 19 Oct 00 at 22:16, Russell King wrote: > I'm seeing an oops caused by a NULL pointer dereference in mm/filemap.c, > filemap_write_page. The NULL pointer in question is page->mapping. > The box on which this is happening is using a root NFS filesystem (in > fact all but one of its filesystems are NFS). You do not have to use NFS - look for my postings with 'page->mapping == NULL' in archive. Your code uses shared mmap, isn't it? Probably shared by couple of processes. > So the file that was mapped has been deleted... Ok, > With the following in filemap_sync_pte: > > this reveals: > > --> page c017e118 mapping NULL - address 40264000 > --> vma->vm_file = c0c8b120 > --> /var/tmp/.nfs16130002 (deleted) > > And the stack trace indicates that the kernel execution path came from > sys_munmap. ... or from sys_exit() if you forget to unmap. Or from anywhere if swapping code decides to swap such page. I'm trying to hunt it down for more than month, but I have no idea what's wrong. In my case way to trigger bug is: (1) shared map some page (2) unlink underlying file (3) fork (4) from child dirty pages (5) from parent truncate file (6) exit both tasks (7) oops Just in case you'll not find oops demo code in archive... Last tested under 2.4.0-test10-pre3. With -pre4 I cannot decide whether to comment out BUG() in vmscan, or not, so I did not test -pre4 yet... Maybe you can simplify code even more. During tests (dual PIII/450, 256MB RAM) system is completely dead for up to 40 secs, so do not panic. It is not needed to hit reset button, just wait ;-) Any idea welcomed. Best regards, Petr Vandrovec [EMAIL PROTECTED] #include #include #include #include #include #include #include #include #include #include unsigned char zero; #define MSIZE 0x0E00 void x4768(void) { int fd; pid_t pid; int x4778_1[2]; int x4778_2[2]; int x4779_1[2]; int x4779_2[2]; int from4778; int to4778; int from4779; int to4779; char* mma[65536]; unsigned int mml[65536]; unsigned int mmi = 0; unsigned int ln = 0; unsigned int x; #define MAL2(a,l) ftruncate(fd, ln+l); mml[mmi] = l; mma[mmi++] = mmap(a, l, PROT_READ|PROT_WRITE, MAP_SHARED, fd, ln); ln += l; mma[mmi-1][0] = 99; #define MAL(l) MAL2(NULL,l) fd = open("ram0", O_RDWR | O_CREAT, 0600); unlink("ram0"); MAL(MSIZE); pipe(x4778_1); pipe(x4778_2); pid = fork(); if (!pid) { int from4768 = x4778_1[0]; int to4768 = x4778_2[1]; close(x4778_1[1]); close(x4778_2[0]); close(fd); read(from4768, &zero, 1); for (x = 0; x < mml[mmi - 1]; x += 4096) mma[mmi-1][x] = 98; write(to4768, &zero, 1); read(from4768, &zero, 1); exit(0); } else if (pid < 0) { perror("fork failed"); exit(222); } to4778 = x4778_1[1]; from4778 = x4778_2[0]; close(x4778_1[0]); close(x4778_2[1]); /* so that read/write to this fails... */ from4779 = -1; to4779 = -1; { write(to4778, &zero, 1); write(to4779, &zero, 1); read(from4778, &zero, 1); read(from4779, &zero, 1); write(to4778, &zero, 1); write(to4779, &zero, 1); sleep(5); ftruncate(fd, 0); } close(fd); while (mmi--) { munmap(mma[mmi], mml[mmi]); } exit(0); } int main(int argc, char* argv[]) { printf("Go\n"); mlockall(MCL_CURRENT); x4768(); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/