Re: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
Erez Zadok wrote: > Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree. > Kernel w/ SMP, preemption, and lockdep configured. This is a real lock ordering problem. Thanks for reporting it. The updating of atime inside sys_mmap() orders the mmap_sem in the vfs outside of the journal handle in ext3's inode dirtying: > -> #1 (jbd_handle){--..}: >[] __lock_acquire+0x9cc/0xb95 >[] lock_acquire+0x5f/0x78 >[] journal_start+0xee/0xf8 >[] ext3_journal_start_sb+0x48/0x4a >[] ext3_dirty_inode+0x27/0x6c >[] __mark_inode_dirty+0x29/0x144 >[] touch_atime+0xb7/0xbc >[] generic_file_mmap+0x2d/0x42 >[] mmap_region+0x1e6/0x3b4 >[] do_mmap_pgoff+0x1fb/0x253 >[] sys_mmap2+0x9b/0xb5 >[] syscall_call+0x7/0xb >[] 0x ext3_direct_IO() orders the journal handle outside of the mmap_sem that dio_get_page() acquires to pin pages with get_user_pages(): > -> #0 (&mm->mmap_sem){}: >[] __lock_acquire+0x8bc/0xb95 >[] lock_acquire+0x5f/0x78 >[] down_read+0x3a/0x4c >[] dio_get_page+0x4e/0x15d >[] __blockdev_direct_IO+0x431/0xa81 >[] ext3_direct_IO+0x10c/0x1a1 >[] generic_file_direct_IO+0x124/0x139 >[] generic_file_direct_write+0x56/0x11c >[] __generic_file_aio_write_nolock+0x33d/0x489 >[] generic_file_aio_write+0x58/0xb6 >[] ext3_file_write+0x27/0x99 >[] do_sync_write+0xc5/0x102 >[] vfs_write+0x90/0x119 >[] sys_write+0x3d/0x61 >[] sysenter_past_esp+0x5f/0xa5 >[] 0x Two fixes come to mind: 1) use something like Peter's ->mmap_prepare() to update atime before acquiring the mmap_sem. ( http://lkml.org/lkml/2007/11/11/97 ). I don't know if this would leave more paths which do a journal_start() while holding the mmap_sem. 2) rework ext3's dio to only hold the jbd handle in ext3_get_block(). Chris has a patch for this kicking around somewhere but I'm told it has problems exposing old blocks in ordered data mode. Does anyone have preferences? I could go either way. I certainly don't like the idea of journal handles being held across the entirety of fs/direct-io.c. It's yet another case of O_DIRECT differing wildly from the buffered path :(. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
> If anyone has a testcase - I can take a look at the problem again. I can try and throw something together.. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix invalidate_inode_pages2_range not to clear ret
Hisashi Hifumi wrote: > Hi. > > DIO invalidates page cache through invalidate_inode_pages2_range(). > invalidate_inode_pages2_range() sets ret=-EIO when > invalidate_complete_page2() > fails, but this ret is cleared if do_launder_page() succeed on a page of > next index. > In this case, dio is carried out even if invalidate_complete_page2() > fails on some pages. > This can cause inconsistency between memory and blocks on HDD because > the page > cache still exists. > > Following patch fixes this issue. I like the idea of fixing this in a separate patch, yes, thanks! But... > > Thanks. > > Signed-off-by :Hisashi Hifumi <[EMAIL PROTECTED]> > > diff -Nrup linux-2.6.24-rc5.org/mm/truncate.c > linux-2.6.24-rc5/mm/truncate.c > --- linux-2.6.24-rc5.org/mm/truncate.c2007-12-12 16:32:45.0 > +0900 > +++ linux-2.6.24-rc5/mm/truncate.c2007-12-13 11:45:29.0 +0900 > @@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct > pgoff_t next; > int i; > int ret = 0; > +int ret2 = 0; > int did_range_unmap = 0; > int wrapped = 0; > > @@ -441,13 +442,13 @@ int invalidate_inode_pages2_range(struct > BUG_ON(page_mapped(page)); > ret = do_launder_page(mapping, page); > if (ret == 0 && !invalidate_complete_page2(mapping, page)) > -ret = -EIO; > +ret2 = -EIO; > unlock_page(page); > } > pagevec_release(&pvec); > cond_resched(); > } > -return ret; > +return !ret ? ret2 : ret; > } > EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range); ... this doesn't work. Notice that it only propagates the -EIO into ret2? It can lose errors from do_launder_page() itself because they aren't stored into ret2, which is what's returned if the last do_launder_page() succeeds. This isn't I meant when I mentioned that 'ret2' pattern. The idea is to store the errors from inner calls into some private var and then only promote those errors as the function's return code (ret) if there was an error and ret wasn't already set. We do this in a few places in dio, you'll notice. This patch gets the meaning of 'ret' and 'ret2' opposite from those uses, which is kind of confusing. So, uh, how about the following. Totally untested, but compiled. --- invalidate_inode_pages2_range(): consistently return first error Hisashi Hifumi noticed that we were losing errors in invalidate_inode_pages2_range(). Later do_launder_page() calls could overwrite errors generated by earlier calls. Fix this by storing do_launder_page in a temporary variable which is only promoted to the function's return code if it hadn't already generated an error. Signed-off-by: Zach Brown <[EMAIL PROTECTED]> Cc: Hisashi Hifumi <[EMAIL PROTECTED]> diff --git a/mm/truncate.c b/mm/truncate.c index cadc156..24578b3 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t next; int i; int ret = 0; + int rc; int did_range_unmap = 0; int wrapped = 0; @@ -439,9 +440,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping, } } BUG_ON(page_mapped(page)); - ret = do_launder_page(mapping, page); - if (ret == 0 && !invalidate_complete_page2(mapping, page)) - ret = -EIO; + rc = do_launder_page(mapping, page); + if (rc == 0 && !invalidate_complete_page2(mapping, page)) + rc = -EIO; + if (rc && ret == 0) + ret = rc; unlock_page(page); } pagevec_release(&pvec); > > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
Hisashi Hifumi wrote: > Hi. > > Current dio has some problems: > 1, In ext3 ordered, dio write can return with EIO because of the race > between invalidation of > a page and jbd. jbd pins the bhs while committing journal so > try_to_release_page fails when jbd > is committing the transaction. Yeah. It sure would be fantastic if some ext3 expert could stop this from happening somehow. But that hasn't happened in.. uh.. Badari, for how many years has this been on the radar? :) > > Past discussion about this issue is as follows. > http://marc.info/?t=11934343124&r=1&w=2 > http://marc.info/?t=11265676282&r=1&w=2 > > 2, invalidate_inode_pages2_range() sets ret=-EIO when > invalidate_complete_page2() > fails, but this ret is cleared if do_launder_page() succeed on a page of > next index. Oops. That's too bad. So maybe we should fix it by not stomping on that return code? ret2 = do_launder() if (ret2 == 0) ret2 = invalidate() if (ret == 0) ret = ret2 I'd be surprised if we ever wanted to mask an -EIO when later pages laundered successfully. > In this case, dio is carried out even if invalidate_complete_page2() > fails on some pages. > This can cause inconsistency between memory and blocks on HDD because > the page > cache still exists. Yeah. > I solved problems above by introducing invalidate_inode_pages3_range() > and falling > through to buffered I/O when invalidation of a page failed. Well, I like the idea of more intelligently dealing with the known problem between dio and ext3. I'm not sure that falling back to buffered is right. If dio can tell that it only failed because jbd held the buffer, should we have waited for the transaction to complete before trying to invalidate again? If we could do that, we'd avoid performing the IO twice. > We can distinguish between failure of page invalidation and other errors > with the return value of invalidate_inode_pages3_range(). I'm not sure duplicating the invalidation loop into a new function is the right thing. Maybe we'd just tweak inode_pages2 to indicate to the caller the specific failing circumstances somehow. Maybe. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]loop cleanup in fs/namespace.c - repost
>> This doesn't look fine. Did you test this? > > Oops, my fault. Of course, I tested the patch, but kernel modules are > disabled in my test setup, so I missed the error. :) > Enclosed to this message is a new patch, which replaces the goto-loop by > the while-based one, but leaves the EXPORT_SYMBOL macro intact. It certainly looks OK to me now, for whatever that's worth. You probably want to wait 'till the next merge window to get it in, though. It's just a cleanup and so shouldn't go in this late in the -rc line. Maybe Andrew will be willing to queue it until that time in -mm. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]loop cleanup in fs/namespace.c - repost
> The patch given below replaces the goto-loop by a while-based one. That certainly looks fine. I would also replace the 'return' with 'break', but I guess that's more of a question of personal preference. > Besides, it removes the export for the same routine, because there are > no users for it outside of the core VFS code. This doesn't look fine. Did you test this? mntput_no_expire() is called from mntput() which is an inline function in mount.h. So lots of callers of mntput() in modules will end up trying to call mntput_no_expire() from modules. $ nm fs/fuse/fuse.ko | grep mntput_no_expire U mntput_no_expire - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fs io with struct page instead of iovecs
Badari Pulavarty wrote: > On Tue, 2007-11-06 at 17:43 -0800, Zach Brown wrote: >> At the FS meeting at LCE there was some talk of doing O_DIRECT writes from >> the >> kernel with pages instead of with iovecs. T > > Why ? Whats the use case ? Well, I think there's a few: There are existing callers which hold a kmap() across ->write, which isn't great. ecryptfs() does this. That's mentioned in the patch series. Arguably loopback should be using this instead of copying some fs paths and trying to call aop methods directly. I seem to remember Christoph and David having stories of knfsd folks in SGI wanting to do O_DIRECT writes from knfsd? (If not, *I* kind of want to, after rolling some patches to align net rx descriptors :)). Lustre shows us that there is a point at which you can't saturate your network and storage if your cpu is copying all the data. I'll be the first to admit that the community might not feel a pressing need to address this for in-kernel file system writers, but the observation remains. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] add rwmem type backed by pages
This lets callers specify a region of memory to read from or write to with an array of page/offset/len tuples. There have been specific requests to do this from servers which want to do O_DIRECT from the kernel. (knfsd?) This could also be used by places which currently hold a kmap() and call fop->write. ecryptfs_write_lower_page_segment() is one such caller. --- fs/rwmem.c| 66 + include/linux/rwmem.h | 16 2 files changed, 82 insertions(+), 0 deletions(-) diff --git a/fs/rwmem.c b/fs/rwmem.c index 0433ba4..c87e8a4 100644 --- a/fs/rwmem.c +++ b/fs/rwmem.c @@ -90,3 +90,69 @@ struct rwmem_ops rwmem_iovec_ops = { .seg_bytes = rwmem_iovec_seg_bytes, .get_seg_pages = rwmem_iovec_get_seg_pages, }; + +void rwmem_pages_init(struct rwmem *rwm) +{ + struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem); + struct pgol *pgol; + unsigned long i; + + rwm->total_bytes = 0; + rwm->nr_pages = rwm->nr_segs; + rwm->boundary_bits = 0; + + for (i = 0; i < rwm->nr_segs; i++) { + pgol = &rwp->pgol[i]; + + rwm->total_bytes += pgol->len; + rwm->boundary_bits |= pgol->offset | pgol->len; + } +} + +/* + * Returns the offset of the start of a segment within its first page. + */ +unsigned long rwmem_pages_seg_page_offset(struct rwmem *rwm, unsigned long i) +{ + struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem); + BUG_ON(i >= rwm->nr_segs); + return rwp->pgol[i].offset; +} + +/* + * Returns the total bytes in the given segment. + */ +unsigned long rwmem_pages_seg_bytes(struct rwmem *rwm, unsigned long i) +{ + struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem); + BUG_ON(i >= rwm->nr_segs); + return rwp->pgol[i].len; +} + +/* + * For now each page is its own seg. + */ +int rwmem_pages_get_seg_pages(struct rwmem *rwm, unsigned long i, + unsigned long *cursor, struct page **pages, + unsigned long max_pages, int write) +{ + struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem); + int ret = 0; + + BUG_ON(i >= rwm->nr_segs); + BUG_ON(*cursor != 0); + + if (max_pages) { + pages[0] = rwp->pgol[i].page; + get_page(pages[0]); + ret = 1; + } + return ret; +} + +struct rwmem_ops rwmem_pages_ops = { + .init = rwmem_pages_init, + .seg_page_offset= rwmem_pages_seg_page_offset, + .seg_bytes = rwmem_pages_seg_bytes, + .get_seg_pages = rwmem_pages_get_seg_pages, +}; diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h index 666f9f4..47019f0 100644 --- a/include/linux/rwmem.h +++ b/include/linux/rwmem.h @@ -26,4 +26,20 @@ struct rwmem_iovec { }; struct rwmem_ops rwmem_iovec_ops; +/* + * How many times do we need this in subsystems before we make a universal + * struct? (bio_vec, skb_frag_struct, pipe_buffer) + */ +struct pgol { + struct page *page; + unsigned int offset; + unsigned int len; +}; + +struct rwmem_pages { + struct rwmemrwmem; + struct pgol *pgol; +}; +struct rwmem_ops rwmem_pages_ops; + #endif -- 1.5.2.2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write
This adds a structure and interface to represent the segments of memory which are acting as the source or destination for a read or write operation. Callers would fill this structure and then pass it down the rw path. The intent is to let stages in the rw path make specific calls against this API and structure instead of working with, say, struct iovec natively. The main intent of this is to enable kernel calls into the rw path which specify memory with page/offset/len tuples. Another potential benefit of this is the reduction in iterations over iovecs at various points in the kernel. Each iov_length(iov) call, for example, could be translated into rwm->total_bytes. O_DIRECTs check of memory alignment is changed into a single test against rwm->boundary_bits. I imagine this might integrate well with the iov_iter interface, though I haven't examined that in any depth. --- fs/Makefile |2 +- fs/rwmem.c| 92 + include/linux/rwmem.h | 29 +++ 3 files changed, 122 insertions(+), 1 deletions(-) create mode 100644 fs/rwmem.c create mode 100644 include/linux/rwmem.h diff --git a/fs/Makefile b/fs/Makefile index 500cf15..c342365 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \ attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o + stack.o rwmem.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff --git a/fs/rwmem.c b/fs/rwmem.c new file mode 100644 index 000..0433ba4 --- /dev/null +++ b/fs/rwmem.c @@ -0,0 +1,92 @@ +#include +#include +#include +#include +#include + +static inline unsigned long pages_spanned(unsigned long addr, + unsigned long bytes) +{ + return ((addr + bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) - + (addr >> PAGE_SHIFT); +} + +void rwmem_iovec_init(struct rwmem *rwm) +{ + struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem); + struct iovec *iov; + unsigned long i; + + rwm->total_bytes = 0; + rwm->nr_pages = 0; + rwm->boundary_bits = 0; + + for (i = 0; i < rwm->nr_segs; i++) { + iov = &rwi->iov[i]; + + rwm->total_bytes += iov->iov_len; + rwm->nr_pages += pages_spanned((unsigned long)iov->iov_base, + iov->iov_len); + rwm->boundary_bits |= (unsigned long)iov->iov_base | + (unsigned long)iov->iov_len; + } +} + +/* + * Returns the offset of the start of a segment within its first page. + */ +unsigned long rwmem_iovec_seg_page_offset(struct rwmem *rwm, unsigned long i) +{ + struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem); + BUG_ON(i >= rwm->nr_segs); + return (unsigned long)rwi->iov[i].iov_base & ~PAGE_MASK; +} + +/* + * Returns the total bytes in the given segment. + */ +unsigned long rwmem_iovec_seg_bytes(struct rwmem *rwm, unsigned long i) +{ + struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem); + BUG_ON(i >= rwm->nr_segs); + return rwi->iov[i].iov_len; +} + +int rwmem_iovec_get_seg_pages(struct rwmem *rwm, unsigned long i, + unsigned long *cursor, struct page **pages, + unsigned long max_pages, int write) +{ + struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem); + struct iovec *iov; + int ret; + + BUG_ON(i >= rwm->nr_segs); + iov = &rwi->iov[i]; + + if (*cursor == 0) + *cursor = (unsigned long)iov->iov_base; + + max_pages = min(pages_spanned(*cursor, iov->iov_len - + (*cursor - (unsigned long)iov->iov_base)), + max_pages); + + down_read(¤t->mm->mmap_sem); + ret = get_user_pages(current, current->mm, *cursor, max_pages, write, +0, pages, NULL); + up_read(¤t->mm->mmap_sem); + + if (ret > 0) { + *cursor += ret * PAGE_SIZE; + if (*cursor >= (unsigned long)iov->iov_base + iov->iov_len) + *cursor = ~0; + } + + return ret; +} + +struct rwmem_ops rwmem_iovec_ops = { + .init = rwmem_iovec_init, + .seg_page_offset= rwmem_iovec_seg_page_offset, + .seg_bytes = rwmem_iovec_seg_bytes, + .get_seg_pages = rwmem_iovec_get_seg_pages, +}; diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h new file mode 100644 index 000..666f9f4 --- /dev/null +++ b/include/linux/rwmem.h @@ -0,0 +1,29 @@ +#ifndef
[PATCH 2/4] dio: use rwmem to work with r/w memory arguments
This switches dio to work with the rwmem api to get memory pages for the IO instead of working with iovecs directly. It can use direct rwm struct accesses for some static universal properties of a set of memory segments that make up the buffer argument. It uses helper functions to work with the underlying data structures directly. --- fs/direct-io.c | 123 +--- 1 files changed, 55 insertions(+), 68 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index acf0da1..0d5ed41 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -34,7 +34,7 @@ #include #include #include -#include +#include #include /* @@ -105,11 +105,12 @@ struct dio { sector_t cur_page_block;/* Where it starts */ /* -* Page fetching state. These variables belong to dio_refill_pages(). +* Page fetching state. direct_io_worker() sets these for +* dio_refill_pages() who modifies them as it fetches. */ - int curr_page; /* changes */ - int total_pages;/* doesn't change */ - unsigned long curr_user_address;/* changes */ + struct rwmem *rwm; + unsigned long cur_seg; + unsigned long cur_seg_cursor; /* * Page queue. These variables belong to dio_refill_pages() and @@ -146,21 +147,11 @@ static inline unsigned dio_pages_present(struct dio *dio) */ static int dio_refill_pages(struct dio *dio) { + struct rwmem *rwm = dio->rwm; int ret; - int nr_pages; - - nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES); - down_read(¤t->mm->mmap_sem); - ret = get_user_pages( - current,/* Task for fault acounting */ - current->mm,/* whose pages? */ - dio->curr_user_address, /* Where from? */ - nr_pages, /* How many pages? */ - dio->rw == READ,/* Write to memory? */ - 0, /* force (?) */ - &dio->pages[0], - NULL); /* vmas */ - up_read(¤t->mm->mmap_sem); + + ret = rwm->ops->get_seg_pages(rwm, dio->cur_seg, &dio->cur_seg_cursor, + dio->pages, DIO_PAGES, dio->rw == READ); if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) { struct page *page = ZERO_PAGE(0); @@ -180,8 +171,6 @@ static int dio_refill_pages(struct dio *dio) } if (ret >= 0) { - dio->curr_user_address += ret * PAGE_SIZE; - dio->curr_page += ret; dio->head = 0; dio->tail = ret; ret = 0; @@ -938,11 +927,9 @@ out: */ static ssize_t direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, - const struct iovec *iov, loff_t offset, unsigned long nr_segs, - unsigned blkbits, get_block_t get_block, dio_iodone_t end_io, - struct dio *dio) + struct rwmem *rwm, loff_t offset, unsigned blkbits, + get_block_t get_block, dio_iodone_t end_io, struct dio *dio) { - unsigned long user_addr; unsigned long flags; int seg; ssize_t ret = 0; @@ -966,44 +953,33 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, spin_lock_init(&dio->bio_lock); dio->refcount = 1; + dio->rwm = rwm; + /* * In case of non-aligned buffers, we may need 2 more * pages since we need to zero out first and last block. */ + dio->pages_in_io = rwm->nr_pages; if (unlikely(dio->blkfactor)) - dio->pages_in_io = 2; - - for (seg = 0; seg < nr_segs; seg++) { - user_addr = (unsigned long)iov[seg].iov_base; - dio->pages_in_io += - ((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE - - user_addr/PAGE_SIZE); - } + dio->pages_in_io += 2; - for (seg = 0; seg < nr_segs; seg++) { - user_addr = (unsigned long)iov[seg].iov_base; - dio->size += bytes = iov[seg].iov_len; + for (seg = 0; seg < rwm->nr_segs; seg++) { + dio->size += bytes = rwm->ops->seg_bytes(rwm, seg); /* Index into the first page of the first block */ - dio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits; + dio->first_block_in_page = + rwm->ops->seg_page_offset(rwm, seg) >> blkbits; dio->final_block_in_request = dio->block_in_file + (bytes >> blkbits); /* Page fetching state */ + dio->cur_seg = seg; + dio->cur_seg_cursor = 0; dio->head = 0; dio->tail = 0; -
[PATCH 4/4] add dio interface for page/offset/len tuples
This is what it might look like to feed pgol in to some part of the fs stack instead of iovecs. I imagine we'd want to do it at a much higher level, perhaps something like vfs_write_pages(). --- fs/direct-io.c | 21 + 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 0d5ed41..e86bcbc 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -1213,3 +1213,24 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, } EXPORT_SYMBOL(__blockdev_direct_IO); + +ssize_t +__blockdev_direct_IO_pages(int rw, struct kiocb *iocb, struct inode *inode, + struct block_device *bdev, struct pgol *pgol, loff_t offset, + unsigned long nr_pages, get_block_t get_block, dio_iodone_t end_io, + int dio_lock_type) +{ + struct rwmem_pages rwp = { + .rwmem.ops = &rwmem_pages_ops, + .rwmem.nr_segs = nr_pages, + .pgol = pgol, + }; + struct rwmem *rwm = &rwp.rwmem; + + rwm->ops->init(rwm); + + return blockdev_direct_IO_rwmem(rw, iocb, inode, bdev, rwm, offset, + get_block, end_io, dio_lock_type); +} + +EXPORT_SYMBOL(__blockdev_direct_IO_pages); -- 1.5.2.2 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] fs io with struct page instead of iovecs
At the FS meeting at LCE there was some talk of doing O_DIRECT writes from the kernel with pages instead of with iovecs. This patch series explores one direction we could head in to achieve this. We obviously can't just translate user iovecs (which might represent more memory than the machine has) to pinned pages and pass page struct pointers all the way down the stack. And we don't want to duplicate a lot of the generic FS machinery down a duplicate path which works with pages instead of iovecs. So this patch series heads in the direction of abstracting out the memory argument to the read and write calls. It's still based on segments, but we hide that it's either iovecs or arrays of page pointers from the callers in the rw stack. I didn't go too nuts with the syntactic sugar. This is just intended to show the basic mechanics. We can obviously pretty it up if we think this is a sane thing to be doing. The series builds but has never been run. What do people (that's you, Christoph) think? I'm flexible. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
> The second use case is to look at the physical layout of blocks on disk > for a specific file, use Mark Lord's write_long patches to inject a disk > error and then read that file to make sure that we are handling disk IO > errors correctly. A bit obscure, but really quite useful. Hmm, yeah, that's interesting. > We have also used FIBMAP a few times to try and map an observed IO error > back to a file. Really slow and painful to do, but should work on any > file system when a better method is not supported. We're getting off of this FIBMAP topic, but this interests me. Can we explore this a little? How did you find out about the error without having a file to associate with it? Drive scrubbing, or some such? - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
> But, we shouldn't inflict all of this on fibmap/fiemapwe'll get > lost trying to make the one true interface for all operations. > > For grouping operations on files, I think a read_tree syscall with > hints for what userland will do (read, stat, delete, list > filenames), and a better cookie than readdir should do it. Agreed, on both points. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
> Can you clarify what you mean above with an example? I don't really > follow. Sure, take 'tar' as an example. It'll read files in the order that their names are returned from directory listing. This can produce bad IO patterns because the order in which the file names are returned doesn't match the order of the file's blocks on disk. (htree, I'm looking at you!) People have noticed that tar-like loads can be sped up greatly just by sorting the files by their inode number as returned by stat(), never mind the file blocks themselves. One example of this is Chris Mason's 'acp'. http://oss.oracle.com/~mason/acp/ The logical extension of that is to use FIBMAP to find the order of file blocks on disk and then doing IO on blocks in sorted order. It'd take work to write an app that does this reliably, sure. In this use the application doesn't actually care what the absolute numbers are. It cares about their ordering. File systems would be able to chose whatever scheme they wanted for the actual values of the results from a FIBMAP-alike as long as the sorting resulted in the right IO patterns. Arguing that this use is significant enough to justify an addition to the file system API is a stretch. I'm just sharing the observation. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
>> And another of my pet peeves with ->bmap is that it uses 0 to mean >> "sparse" which causes a conflict on NTFS at least as block zero is >> part of the $Boot system file so it is a real, valid block... NTFS >> uses -1 to denote sparse blocks internally. > > Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there > was a way to indicate your-data-is-here-but-isn't-alone. But that's > more of a feature for the FIEMAP stuff. And maybe we can step back and see what the callers of FIBMAP are doing with the results they're getting. One use is to discover the order in which to read file data that will result in efficient IO. If we had an interface specifically for this use case then perhaps a sparse block would be better reported as the position of the inode relative to other data blocks. Maybe the inode block number in ext* land. This use is interesting to me because it can be useful for any file system, particularly networked file systems. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]
Roland Dreier wrote: > > > +static inline void *ERR_CAST(const void *ptr) > > > +{ > > > +return (void *) ptr; > > > +} > > > > Just to nit, surely you don't need the cast inside the function. The > > casting happens at the call site between the argument and returned pointer. > > The way it's written you kinda do, since it takes a const void * and > returns a plain void *. Ah, right, I missed that. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]
> + * ERR_CAST - Explicitly cast an error-valued pointer to another pointer type > + * @ptr: The pointer to cast. > + * > + * Explicitly cast an error-valued pointer to another pointer type in such a > + * way as to make it clear that's what's going on. > + */ > +static inline void *ERR_CAST(const void *ptr) > +{ > + return (void *) ptr; > +} Just to nit, surely you don't need the cast inside the function. The casting happens at the call site between the argument and returned pointer. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()
return ERR_PTR(PTR_ERR(inode)); I tend to prefer the latter. It seems like a pretty noisy way to get a (void *) cast :/. Maybe a function that has the cast but makes sure it's only used for IS_ERR() pointers? /* haha, continuing the fine tradition of terrible names in this api.. */ static inline void *PTR_PTR(void *err_ptr) { BUG_ON(!IS_ERR(err_ptr) || !err_ptr); return err_ptr; } Meh. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()
If you're soliciting opinions, I think I tend to prefer the feel of the code paths after the changes. I don't know the benefits of the change are worth the risk in unmaintained file systems, though. > + return ERR_PTR(PTR_ERR(inode)); This caught my eye. Surely we can do better :). It seems to happen a few times in the patches, the instance in this patch was the first that I noticed. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode
> I fear the consequences of this change :( I love it. In the past I've lost time by working with patches which didn't quite realize that ext3 holds a transaction open during ->direct_IO. > Oh well, please keep it alive, maybe beat on it a bit, resend it > later on? I can test the patch to make sure that it catches mistakes I've made in the past. Peter, do you have any interest in seeing how far we can get at tracking lock_page()? I'm not holding my breath, but any little bit would probably help. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup details
- repair driven design, we know what it is (Val told us), but how does it apply to the things we are currently working on? should we do more of it? I'm sure Chris and I could talk about the design elements in btrfs that should aid repair if folks are interested in hearing about them. We'd keep the hand-waving to a minimum :). - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] dio: remove bogus refcounting BUG_ON
the BUG_ON(). But unfortunately, our perf. team is able reproduce the problem. What are they doing to reproduce it? How much setup does it take? Debug indicated that, the ret2 == 1 :( That could be consistent with the theory that we're racing with the dio struct being freed and reused before it's tested in the BUG_ON() condition. Suparna's suggestion to sample dio->is_async before releasing the refcount and using that in the BUG_ON condition is a good one. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] dio: remove bogus refcounting BUG_ON
Linus, Andrew, please apply the bug fix patch at the end of this reply for .22. > >>One of our perf. team ran into this while doing some runs. > >>I didn't see anything obvious - it looks like we converted > >>async IO to synchronous one. I didn't spend much time digging > >>around. OK, I think this BUG_ON() is just broken. I wasn't able to find any obvious bugs from reading the code which would cause the BUG_ON() to fire. If it's reproducible I'd love to hear what the recipe is. I did notice that this BUG_ON() is evaluating dio after having dropped it's ref :/. So it's not completely absurd to fear that it's a race with the dio's memory being reused, but that'd be a pretty tight race. Let's remove this stupid BUG_ON and see if that test box still has trouble. It might just hit the valid BUG_ON a few lines down, but this unsafe BUG_ON needs to go. --- dio: remove bogus refcounting BUG_ON Badari Pulavarty reported a case of this BUG_ON is triggering during testing. It's completely bogus and should be removed. It's trying to notice if we left references to the dio hanging around in the sync case. They should have been dropped as IO completed while this path was in dio_await_completion(). This condition will also be checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few lines lower. So to start this BUG_ON() is redundant. More fatally, it's dereferencing dio-> after having dropped its reference. It's only safe to dereference the dio after releasing the lock if the final reference was just dropped. Another CPU might free the dio in bio completion and reuse the memory after this path drops the dio lock but before the BUG_ON() is evaluated. This patch passed aio+dio regression unit tests and aio-stress on ext3. Signed-off-by: Zach Brown <[EMAIL PROTECTED]> Cc: Badari Pulavarty <[EMAIL PROTECTED]> diff -r 509ce354ae1b fs/direct-io.c --- a/fs/direct-io.cSun Jul 01 22:00:49 2007 + +++ b/fs/direct-io.cTue Jul 03 14:56:41 2007 -0700 @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i spin_lock_irqsave(&dio->bio_lock, flags); ret2 = --dio->refcount; spin_unlock_irqrestore(&dio->bio_lock, flags); - BUG_ON(!dio->is_async && ret2 != 0); + if (ret2 == 0) { ret = dio_complete(dio, offset, ret); kfree(dio); - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DIO panic on 2.6.21.5
On Jun 27, 2007, at 8:01 PM, Badari Pulavarty wrote: Hi Zach, One of our perf. team ran into this while doing some runs. I didn't see anything obvious - it looks like we converted async IO to synchronous one. I didn't spend much time digging around. It looks pretty bad, a *shouldn't happen* kind of case. I'm sure it just means I missed some case :/. Is this a known issue ? Any ideas ? I haven't seen it before, no. Do they have a reliable recipe to reproduce it? In any case, I'll dig in early next week when I'm back from OLS. I'm pretty confident that we'll be able to find the case. Please let me know if it's urgent and I should try and find time before then. Thanks for reporting it. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup in september?
> > I'd just like to take the chance also to ask about a VM/FS meetup some > > time around kernel summit (maybe take a big of time during UKUUG or so). Yeah, I'd be interested. > More issues: - chris mason's patches to normalize buffered and direct locking - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] eCryptfs: Delay writing 0's after llseek until write
> FWIW, I believe Andrew's point was that critical information for Joe > Enduser (and Joe Patch-Ho) was lacking in the original changelog. and don't forget Joe eCryptfs-Maintainer-2-Years-In-The-Future. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] fs: add an iovec iterator
What I have there is not actually a full-blown file io descriptor, because there is no file or offset. It is just an iovec iterator (so maybe I should rename it to iov_iter, rather than iodesc). I think it might be a nice idea to keep this iov_iter as a standalone structure, and it could be embedded into a struct file_io? Yeah, maybe. I'm not sure we need something as generic as a "file_io" struct. To recap, I've hoped for the expression of the state of an iovec array with a richer structure to avoid the multiple walks of the array at different parts of the kernel that previously only had access to the raw iovec * and size_t count. Stuff like the alignment checks in __blockdev_direct_IO() and the pages_in_io accounting in direct_io_worker(). I imagined building up the state in this 'iodesc' structure as we first copied and verified the structure from userspace. (say in rw_copy_check_uvector()). If as we copied we, say, stored in the bits of the buffer and length fields then by the time we got to __blockdev_direct_IO() we'd just test the bits for misalignment instead of iterating over the array again. It starts to get gross as some paths currently modify the kernel copy of the array as they process it :/. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking
The test case Linus sent me boils down to this: fd = open(file) buffer = mmap(fd, 128 pages); close(fd); fd = open(file, O_DIRECT); write(fd, buffer, 66 pages); Yeah, though I bet the inner close/open isn't needed. I think the deadlock is limited to cases where get_user_pages will get stuck in filemap_nopage waiting for placeholders inserted by this DIO. Yeah. It looks like that can only happen when buffer is mapped at the start of the dio. At the *start* of the dio or by the time get_user_pages() is called? The dio and mmap() aren't serialized, are they? mmap() just sets up the vma, I thought, and will only touch the mmap_sem. I'm fearing threads racing write(fd, buffer, ) and mmap(buffer, MAP_FIXED...). I might just be missing the locking that serializes them. If nothing else, this should be mentioned in the comment above the code that looks like a racy test against the presence of a mapping. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking
+static void dio_unlock_page_range(struct dio *dio) +{ + if (dio->lock_type != DIO_NO_LOCKING) { + remove_placeholder_pages(dio->inode->i_mapping, +dio->fspages_start_off, +dio->fspages_end_off); + dio->fspages_end_off = dio->fspages_start_off; I wonder if it should call remove_placeholder_pages when _end_off != _start_off. That feels like a more direct test. + max_size = (dio->fspages_end_off - index) << PAGE_CACHE_SHIFT; + if (map_bh->b_size > max_size) + map_bh->b_size = max_size; b_size = min(b_size, math) ? + /* if the mapping is mapped, they may be using a mmap'd portion +* of the file as the buffer for this io. That will deadlock +* with placeholders because the placeholder code forces the +* page fault handler to block. The (ugly) solution is to +* limit the span of inserted placeholders to the same +* increment we use for get_user_pages. +*/ + if (inode->i_mapping->nrpages || + mapping_mapped(inode->i_mapping)) + dio->fspages_span = DIO_PAGES; + else + dio->fspages_span = ULONG_MAX; Couldn't the decision to use DIO_PAGES in the presence of a mapping race with establishing a mapping? I fear we have to use DIO_PAGES all the time, which at least has the benefit of getting rid of the _span member of dio :). - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1 of 8] Introduce a place holder page for the pagecache
Placeholders can span a range bigger than one page. The placeholder is inserted into the radix slot for the end of the range, and the flags field in the page struct is used to record the start of the range. Not surprisingly, I like how this makes the interaction between buffered and O_DIRECT a lot less magical. Nice work. Some comments: +void wait_on_placeholder_pages_range(struct address_space *mapping, + pgoff_t start, pgoff_t end) ... + while(start <= end) { ... + start = pages[i]->index + 1; + if (pages[i]->index > end) + goto done; It looks like this might be confused by a end and final page->index of ~0? Maybe it'd be safer to explicitly deal with this here instead of relying on safe arguments. + } + if (need_resched()) { + read_unlock_irq(&mapping->tree_lock); + cond_resched(); + read_lock_irq(&mapping->tree_lock); + } We should introduce a cond_sched_lock_irq() to match cond_resched_lock (). +void remove_placeholder_pages(struct address_space *mapping, +unsigned long start, +unsigned long end) +{ + struct page *page; + int ret; + int i; + unsigned long num; + struct page *pages[8]; + + write_lock_irq(&mapping->tree_lock); + while (start < end) { ... + write_unlock_irq(&mapping->tree_lock); And call that cond_resched_lock_irq() in here too? Since we grabbing the lock in here, presumably the caller won't mind if we grab and release it a few times? +static int insert_placeholder(struct address_space *mapping, +struct page *insert) +{ spin_lock_assert(&mapping->tree_lock)? +int find_or_insert_placeholders(struct address_space *mapping, + unsigned long start, unsigned long end, + gfp_t gfp_mask, int iowait) +{ ... + /* +* this gets complicated. It sure does! Sprinkling a few comments above the main blocks in the loop might be nice. + err = filemap_write_and_wait_range(mapping, +cur << PAGE_CACHE_SHIFT, +end << PAGE_CACHE_SHIFT); This end is exclusive (>= end breaks) but filemap_write_and_wait_range () -> wait_on_page_writeback_range() wants an argument that is inclusive (> end breaks)? + if (PagePlaceHolder(page)) { + DEFINE_WAIT(wait); + wait_queue_head_t *wqh = page_waitqueue(page); + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); + read_unlock_irq(&mapping->tree_lock); + io_schedule(); + finish_wait(wqh, &wait); + read_lock_irq(&mapping->tree_lock); This pattern seems to be repeated quite a few times. Maybe a helper would.. help? Maybe it's not worth it if it has to do silly little pre-schedule tasks differently at each site. - for (i = 0; i < ret; i++) - page_cache_get(pages[i]); + while(i < ret) { + if (PagePlaceHolder(pages[i])) { + /* we can't return a place holder, shift it away */ + if (i + 1 < ret) { + memcpy(&pages[i], &pages[i+1], + (ret - i - 1) * sizeof(struct page *)); + } + ret--; + continue; + } else + page_cache_get(pages[i]); + i++; It looks like this loop is implemented twice? How about a factoring it out? Especially with the risk of off-by-ones in there. (Though I don't see problems, it looks correct.) More on the rest of the series after lunch :). - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on a series of AIO patchsets
generic_write_checks() are done in the submission path, not repeated during retries, so such types of checks are not intended to run in the aio thread. Ah, I see, I was missing the short-cut which avoids re-running parts of the write path if we're in a retry. if (!is_sync_kiocb(iocb) && kiocbIsRestarted(iocb)) { /* nothing to transfer, may just need to sync data */ return ocount; It's pretty subtle that this has to be placed before the first significant current reference and that nothing in the path can return -EIOCBRETRY until after all of the significant current references. In totally unrelated news, I noticed that current->io_wait is set to NULL instead of ¤t->__wait after having run the iocb. I wonder if it shouldn't be saved and restored instead. And maybe update the comment over is_sync_wait()? Just an observation. That is great and I look forward to it :) I am, however assuming that whatever implementation you come up will have a different interface from current linux aio -- i.e. a next generation aio model, that will be easily integratable with kevents etc. Yeah, that's the hope. Which takes me back to Ingo's point - lets have the new evolve parallely with the old, if we can, and not hold up the patches for POSIX AIO to start using kernel AIO, or for epoll to integrate with AIO. Sure, though there are only so many hours in a day :). OK, I just took a quick look at your blog and I see that you are basically implementing Linus' microthreads scheduling approach - one year since we had that discussion. Yeah. I wanted to see what it would look like. Glad to see that you found a way to make it workable ... We, that remains to be seen. If nothing else we'll at least hav code to point at when discussing it. If we all agree it's not the right way and dismiss the notion, fine, that's progress :). (I'm guessing that you are copying over the part of the stack in use at the time of every switch, is that correct ? That was my first pass, yeah. It turned the knob a little too far towards the "invasive but efficient" direction for my taste. I'm now giving it a try by having full stacks for each blocked op, we'll see how that goes. At what point do you do the allocation of the saved stacks ? I was allocating at block-time to keep memory consumption down, but I think my fiddling around with it convinced me that isn't workable. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on a series of AIO patchsets
Sorry for the delay, I'm finally back from the holiday break :) (1) The filesystem AIO patchset, attempts to address one part of the problem which is to make regular file IO, (without O_DIRECT) asynchronous (mainly the case of reads of uncached or partially cached files, and O_SYNC writes). One of the properties of the currently implemented EIOCBRETRY aio path is that ->mm is the only field in current which matches the submitting task_struct while inside the retry path. It looks like a retry-based aio write path would be broken because of this. generic_write_checks() could run in the aio thread and get its task_struct instead of that of the submitter. The wrong rlimit will be tested and SIGXFSZ won't be raised. remove_suid() could check the capabilities of the aio thread instead of those of the submitter. I don't think EIOCBRETRY is the way to go because of this increased (and subtle!) complexity. What are the chances that we would have ever found those bugs outside code review? How do we make sure that current references don't sneak back in after having initially audited the paths? Take the io_cmd_epoll_wait patch.. issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with modifications from Jeff Moyer and me) addresses this problem for native linux aio in a simple manner. It's simple looking, sure. This current flipping didn't even occur to me while throwing the patch together! But that patch ends up calling ->poll (and poll_table->qproc) and writing to userspace (so potentially calling ->nopage) from the aio threads. Are we sure that none of them will behave surprisingly because current changed under them? It might be safe now, but that isn't really the point. I'd rather we didn't have yet one more subtle invariant to audit and maintain. At the risk of making myself vulnerable to the charge of mentioning vapourware, I will admit that I've been working on a (slightly mad) implementation of async syscalls. I've been quiet about it because I don't want to whip up complicated discussion without being able to show code that works, even if barely. I mention it now only to make it clear that I want to be constructive, not just critical :). - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EMAIL PROTECTED]: Re: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs]
> Currently we don't support buffered aio on any filesystem in > mainline, so adding crufty code to mainline sounds like a bad idea. > Zab agreed on that and wants to remove it as much as it gets. Yeah, we aim to simplify this code. For the record, it wasn't buffered aio that was the problem. There were two naughty moving parts: First, trying not to block in the dlm when issuing aio ops and tracking state to restart after dlm ops returned eiocbqueued. This was just overly aggressive. This can behave like block mapping lookups in that it rarely blocks. Most aio that people care about (direct io writes to already allocated regions) will simply be acquiring and releasing shared-read locks around each op -- trivial local operations. Second, trying to hold dlm locks around the entirety of aio ops. This led to the mess of trying to tear down locks in the iocb dtor method. (which can then race with unmount, aio does __fput on the filp, dropping the vfsmount ref, before calling dtor.. bleh). We can get around this by unlocking after performing the block mapping lookups and issueing the io and introducing a cluster DLM lock which behaves like i_alloc_sem. So, how about a patch that lets the fs provide a callback to acquire/release i_alloc_sem at the current sites (dio, notify_change) that work with it? Most file systems wouldn't provide a callback and the code would just use the sem as usual, but clustered guys could use dlm locking. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] configfs, a filesystem for userspace-driven kernel object configuration
Matt Mackall wrote: > On Sun, Apr 03, 2005 at 12:57:28PM -0700, Joel Becker wrote: > >> An interface in /proc where the API is: >>or an ioctl(2) interface where the API is: >> >>becomes this in configfs: >> >> # cd /config/mythingy >> # mkdir foo >> # echo 1 > foo/index >> # echo 3 > foo/count >> # echo 0x00013 > foo/address >> >> Instead of a binary blob that's passed around or a cryptic >>string that has to be formatted just so, configfs provides an interface >>that's completely scriptable and navigable. > > How does the kernel know when to actually create the object? "actually create", huh? :) In the trivial case Joel describes, the item is almost certainly allocated during "# mkdir foo" when the subsystem will get a ->make_item() call for the 'mythingy' group it registerd. The various attribute writes then find the item by following their configfs_attribute argument to the item that its embedded in. But I bet you're not really asking about creation. I bet you're wondering how the kernel will know when enough attributes have been filled and that it's safe to use the object. Misguided items could assign magical ordering to the attribute filling such that once a final attribute is set, and others have been set, the item goes live. That's what ocfs2 does now, sadly, but certainly not as a long-term solution. The missing piece is the 'commit_item' group operation that is yet to be implemented. The intent is to have a directory of pending items that can have their attributes filled before being rename()ed into a directory of items that are in active use. The commit_item() call that hits at rename() would give the kernel the chance to refuse the item because attributes haven't been filled in or conflict with existing items, or whatever. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] configfs, a filesystem for userspace-driven kernel object configuration
Arjan van de Ven wrote: > On Sun, 2005-04-03 at 12:57 -0700, Joel Becker wrote: > >>Folks, >> I humbly submit configfs. With configfs, a configfs >>config_item is created via an explicit userspace operation: mkdir(2). >>It is destroyed via rmdir(2). The attributes appear at mkdir(2) time, >>and can be read or modified via read(2) and write(2). readdir(3) >>queries the list of items and/or attributes. >> The lifetime of the filesystem representation is completely >>driven by userspace. The lifetime of the objects themselves are managed >>by a kref, but at rmdir(2) time they disappear from the filesystem. > > > does that mean you rmdir a non-empty directory ?? Yeah, but only attributes and default groups are automatically torn down. You can't rmdir() an item that is the destination of links and you can't rmdir() groups that still contain items. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Address space operations questions
Bryan Henderson wrote: >>So, semantics of ->sync_page() are roughly "kick underlying storage >>driver to actually perform all IO queued for this page, and, maybe, for >>other pages on this device too". > > > I prefer to think of it in a more modular sense. To preserve modularity, > the caller of sync_page() can't know anything about I/O scheduling. So I > think the semantics of ->sync_page() are "Someone is about to wait for the > results of the previously requested write_page on this page." It's > completely up to the owner of the address space to figure out what would > be appropriate to do given that information. Though I agree with your desire for a "modular" interpretation, I'm going disagree your description of the information sync_page() provides. Nikita's vague description is closer to the truth because sync_page() is vague. If you follow the callers of sync_page() you quickly find that what it *really* means to be called in sync_page() is that you're being told that some process is about to block on that page. For what reason, you can't know from the call alone. Waiting for read to complete and unlock? Waiting for writeback to clear the writeback bit? Some processes just happened to race to lock_page() on the same page for reasons that have nothing to do with IO? And the not-so-initiated might think that sync_page() is called with the page lock just like the other ops. That's obviously not the case, as it is called when the page lock can't be acquired, but it makes actually *doing* something with that page argument perilous. Synchronization is left to the sync_page() callee. For example, cachefs seems to be among the very few who don't provide block_sync_page() directly. All it does is printk() and then call block_sync_page(). In that single line of code it derefs page->mapping without checking and so can probably fatally race with truncation and the nulling of page->mapping. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Efficient handling of sparse files
> Please keep one thing in mind and that is that there are file systems > where ->bmap actually makes no sense whatsoever Of course, so return -ESORRY. > This is one of the reasons why noone should be using ->bmap. It is a > stupid interface that only fits very particular sets of file systems and > cannot be applied generically. No, it's a reason to only ask about the details of block mapping in cases where it actually makes sense (like, wanting to find out of concurrent file extension is getting good batched contiguous allocation, etc). Just because file systems x, y, and z can't answer the question meaningfully doesn't mean it isn't a reasonble thing to ask of file systems m, n, and o. Now, I'm not at all opposed to an explicit sparse-testing interface that doesn't confuse that functionality with querying specific block mappings. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Efficient handling of sparse files
> I was wondering if we could introduce a new system call (or ioctl?) that, > given an fd would find the next block with data in it. We could use the > ->bmap method ... except that has dire warnings about adding new callers > and viro may soon be in testicle-gouging range. Hmm. What you're talking about reminds me of some ioctl()s Alex has for ext3+extents and it feels like the pagevec apis that want to find populated pages across the page cache index space. Sooo. struct fs_extent { u64 file_start; u64 block_start; u64 contig; }; (I don't really care if those are in bytes or blocks or whatever. someone with strong opinions can pick a unit :)) long sys_find_extents_please(int fd, off_t file_start, struct fs_extent *extents, long nr_extents); so it'll fill in as many extent structs in the caller as it finds contiguous regions on disk starting with the given file position, returning the number populated. I'd, somewhat obviously, want to push this into an fs method perhaps with a generic_ that just spins on bmap(). I think this would let Alex kill his ioctl() and ocfs2 could certainly fill this with reasonable results. To move into lala land, I wonder if we would want to consider the difference between mapped blocks with data and mapped blocks that haven't been touched and which are going to return zeros. One could argue that it's marginally ridiculous that there isn't a shared interface to reserve blocks without having to manually zero them. If we did have such an interface, something like rsync doesn't actually care that they're mapped if the fs knows that they're still just zeros. I acknowledge that this is a bit out there :) In any case, if you want help rolling proofs-of-concept I could lend a few hours. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html