On Thu, 26 May 2011 14:38:59 +0100 Simon Wilkinson <[email protected]> wrote:
> When we write_begin on ext3 we start a journal, which isn't completed > until write_end is called. So, if we page fault whilst we are copying > between userspace and kernel, we will re-enter the journal, and see > the assert you see. However, the kernel should prevent this page fault > from ever occurring, as it can cause deadlocks (the page fault may > result in memory pressure which causes pages to be flushed, but you're > already in a filesystem, and you then deadlock). So, write() ensures > that all user pages required for the copy are in memory before calling > write_begin, and then actually disables pagefaults during the duration > of the copy. > > I suspect that the reason why you can't reproduce this on your test > system, but are seeing it in the wild, is that 2.6.9 has some, but not > all, of this logic, and so when testing you're seeing pagefaults > occurring before begin_write (prepare_write, on something that old) is > called, but on the "real" system, memory pressure is causing a race > whereby a page that has been swapped in is being swapped out again > before it can be used. Okay yeah, that makes sense. The vanilla 2.6.9 has basically: fault_in_pages_readable(buf, bytes); page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); status = a_ops->prepare_write(file, page, offset, offset+bytes); filemap_copy_from_user(page, offset, buf, bytes); I was looking at a backtrace in my tests, but it's not always easy to see where exactly I'm being called from, since that's just recording the last N functions you've been in, if I understand it correctly. (the panic traces I trust more, though) So (in a hypothetical panic in vanilla 2.6.9) the page fault happens before page_write, but they are evicted from memory again before filemap_copy_from_user is called again, presumably. It doesn't seem like it's possible for this to be our fault, then, unless we somehow screwed up something during the pre-prepare_write fault? It's not immediately clear to me how even modern Linux handles this, though. Say, for example, a callback break comes in between those calls and invalidates the pages; would the call to truncate_inode_pages or whatever block until the write finishes (from some Linux lock), or would filemap_copy_from_user (or whatever the modern analogue is) return an error that causes the operation to be retried or something? -- Andrew Deason [email protected] _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
