from:"Zach Brown"

Re: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)

2008-01-02 Thread Zach Brown

Erez Zadok wrote:
> Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree.
> Kernel w/ SMP, preemption, and lockdep configured.

This is a real lock ordering problem.  Thanks for reporting it.

The updating of atime inside sys_mmap() orders the mmap_sem in the vfs
outside of the journal handle in ext3's inode dirtying:

> -> #1 (jbd_handle){--..}:
>[] __lock_acquire+0x9cc/0xb95
>[] lock_acquire+0x5f/0x78
>[] journal_start+0xee/0xf8
>[] ext3_journal_start_sb+0x48/0x4a
>[] ext3_dirty_inode+0x27/0x6c
>[] __mark_inode_dirty+0x29/0x144
>[] touch_atime+0xb7/0xbc
>[] generic_file_mmap+0x2d/0x42
>[] mmap_region+0x1e6/0x3b4
>[] do_mmap_pgoff+0x1fb/0x253
>[] sys_mmap2+0x9b/0xb5
>[] syscall_call+0x7/0xb
>[] 0x

ext3_direct_IO() orders the journal handle outside of the mmap_sem that
dio_get_page() acquires to pin pages with get_user_pages():

> -> #0 (&mm->mmap_sem){}:
>[] __lock_acquire+0x8bc/0xb95
>[] lock_acquire+0x5f/0x78
>[] down_read+0x3a/0x4c
>[] dio_get_page+0x4e/0x15d
>[] __blockdev_direct_IO+0x431/0xa81
>[] ext3_direct_IO+0x10c/0x1a1
>[] generic_file_direct_IO+0x124/0x139
>[] generic_file_direct_write+0x56/0x11c
>[] __generic_file_aio_write_nolock+0x33d/0x489
>[] generic_file_aio_write+0x58/0xb6
>[] ext3_file_write+0x27/0x99
>[] do_sync_write+0xc5/0x102
>[] vfs_write+0x90/0x119
>[] sys_write+0x3d/0x61
>[] sysenter_past_esp+0x5f/0xa5
>[] 0x

Two fixes come to mind:

1) use something like Peter's ->mmap_prepare() to update atime before
acquiring the mmap_sem.  ( http://lkml.org/lkml/2007/11/11/97 ).  I
don't know if this would leave more paths which do a journal_start()
while holding the mmap_sem.

2) rework ext3's dio to only hold the jbd handle in ext3_get_block().
Chris has a patch for this kicking around somewhere but I'm told it has
problems exposing old blocks in ordered data mode.

Does anyone have preferences?  I could go either way.  I certainly don't
like the idea of journal handles being held across the entirety of
fs/direct-io.c.  It's yet another case of O_DIRECT differing wildly from
the buffered path :(.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails

2007-12-14 Thread Zach Brown


> If anyone has a testcase - I can take a look at the problem again.

I can try and throw something together..

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix invalidate_inode_pages2_range not to clear ret

2007-12-13 Thread Zach Brown

Hisashi Hifumi wrote:
> Hi.
> 
> DIO invalidates page cache through invalidate_inode_pages2_range().
> invalidate_inode_pages2_range() sets ret=-EIO when
> invalidate_complete_page2()
> fails, but this ret is cleared if do_launder_page() succeed on a page of
> next index.
> In this case, dio is carried out even if invalidate_complete_page2()
> fails on some pages.
> This can cause inconsistency between memory and blocks on HDD because
> the page
> cache still exists.
> 
> Following patch fixes this issue.

I like the idea of fixing this in a separate patch, yes, thanks!  But...

> 
> Thanks.
> 
> Signed-off-by :Hisashi Hifumi <[EMAIL PROTECTED]>
> 
> diff -Nrup linux-2.6.24-rc5.org/mm/truncate.c
> linux-2.6.24-rc5/mm/truncate.c
> --- linux-2.6.24-rc5.org/mm/truncate.c2007-12-12 16:32:45.0
> +0900
> +++ linux-2.6.24-rc5/mm/truncate.c2007-12-13 11:45:29.0 +0900
> @@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct
>  pgoff_t next;
>  int i;
>  int ret = 0;
> +int ret2 = 0;
>  int did_range_unmap = 0;
>  int wrapped = 0;
> 
> @@ -441,13 +442,13 @@ int invalidate_inode_pages2_range(struct
>  BUG_ON(page_mapped(page));
>  ret = do_launder_page(mapping, page);
>  if (ret == 0 && !invalidate_complete_page2(mapping, page))
> -ret = -EIO;
> +ret2 = -EIO;
>  unlock_page(page);
>  }
>  pagevec_release(&pvec);
>  cond_resched();
>  }
> -return ret;
> +return !ret ? ret2 : ret;
>  }
>  EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);

... this doesn't work. Notice that it only propagates the -EIO into ret2? It 
can lose errors from do_launder_page() itself because they aren't stored into 
ret2, which is what's returned if the last do_launder_page() succeeds.

This isn't I meant when I mentioned that 'ret2' pattern. The idea is to store 
the errors from inner calls into some private var and then only promote those 
errors as the function's return code (ret) if there was an error and ret wasn't 
already set.  We do this in a few places in dio, you'll notice.  This patch 
gets the meaning of 'ret' and 'ret2' opposite from those uses, which is kind of 
confusing.

So, uh, how about the following. Totally untested, but compiled.

---

invalidate_inode_pages2_range(): consistently return first error

Hisashi Hifumi noticed that we were losing errors in
invalidate_inode_pages2_range().  Later do_launder_page() calls could overwrite
errors generated by earlier calls.  Fix this by storing do_launder_page in a
temporary variable which is only promoted to the function's return code if it
hadn't already generated an error.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
Cc: Hisashi Hifumi <[EMAIL PROTECTED]> 

diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..24578b3 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
pgoff_t next;
int i;
int ret = 0;
+   int rc;
int did_range_unmap = 0;
int wrapped = 0;
 
@@ -439,9 +440,11 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
}
}
BUG_ON(page_mapped(page));
-   ret = do_launder_page(mapping, page);
-   if (ret == 0 && !invalidate_complete_page2(mapping, 
page))
-   ret = -EIO;
+   rc = do_launder_page(mapping, page);
+   if (rc == 0 && !invalidate_complete_page2(mapping, 
page))
+   rc = -EIO;
+   if (rc && ret == 0)
+   ret = rc;
unlock_page(page);
}
pagevec_release(&pvec);

>  
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails

2007-12-11 Thread Zach Brown

Hisashi Hifumi wrote:
> Hi.
> 
> Current dio has some problems:
> 1, In ext3 ordered, dio write can return with EIO because of the race
> between invalidation of
> a page and jbd. jbd pins the bhs while committing journal so
> try_to_release_page fails when jbd
> is committing the transaction.

Yeah.  It sure would be fantastic if some ext3 expert could stop this
from happening somehow.  But that hasn't happened in.. uh.. Badari, for
how many years has this been on the radar? :)

> 
> Past discussion about this issue is as follows.
> http://marc.info/?t=11934343124&r=1&w=2
> http://marc.info/?t=11265676282&r=1&w=2
> 
> 2, invalidate_inode_pages2_range() sets ret=-EIO when
> invalidate_complete_page2()
> fails, but this ret is cleared if do_launder_page() succeed on a page of
> next index.

Oops.  That's too bad.  So maybe we should fix it by not stomping on
that return code?

ret2 = do_launder()
if (ret2 == 0)
ret2 = invalidate()
if (ret == 0)
ret = ret2

I'd be surprised if we ever wanted to mask an -EIO when later pages
laundered successfully.

> In this case, dio is carried out even if invalidate_complete_page2()
> fails on some pages.
> This can cause inconsistency between memory and blocks on HDD because
> the page
> cache still exists.

Yeah.

> I solved problems above by introducing invalidate_inode_pages3_range()
> and falling
> through to buffered I/O when invalidation of a page failed.

Well, I like the idea of more intelligently dealing with the known
problem between dio and ext3.  I'm not sure that falling back to
buffered is right.

If dio can tell that it only failed because jbd held the buffer, should
we have waited for the transaction to complete before trying to
invalidate again?

If we could do that, we'd avoid performing the IO twice.

> We can distinguish between failure of page invalidation and other errors
> with the return value of invalidate_inode_pages3_range().

I'm not sure duplicating the invalidation loop into a new function is
the right thing.  Maybe we'd just tweak inode_pages2 to indicate to the
caller the specific failing circumstances somehow.  Maybe.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH]loop cleanup in fs/namespace.c - repost

2007-11-21 Thread Zach Brown


>> This doesn't look fine.  Did you test this?
> 
> Oops, my fault. Of course, I tested the patch, but kernel modules are
> disabled in my test setup, so I missed the error.

:)

> Enclosed to this message is a new patch, which replaces the goto-loop by
> the while-based one, but leaves the EXPORT_SYMBOL macro intact.

It certainly looks OK to me now, for whatever that's worth.

You probably want to wait 'till the next merge window to get it in,
though.  It's just a cleanup and so shouldn't go in this late in the -rc
line.

Maybe Andrew will be willing to queue it until that time in -mm.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH]loop cleanup in fs/namespace.c - repost

2007-11-21 Thread Zach Brown


> The patch given below replaces the goto-loop by a while-based one.

That certainly looks fine.  I would also replace the 'return' with
'break', but I guess that's more of a question of personal preference.

> Besides, it removes the export for the same routine, because there are
> no users for it outside of the core VFS code.

This doesn't look fine.  Did you test this?

mntput_no_expire() is called from mntput() which is an inline function
in mount.h.  So lots of callers of mntput() in modules will end up
trying to call mntput_no_expire() from modules.

$ nm fs/fuse/fuse.ko | grep mntput_no_expire
 U mntput_no_expire

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] fs io with struct page instead of iovecs

2007-11-07 Thread Zach Brown

Badari Pulavarty wrote:
> On Tue, 2007-11-06 at 17:43 -0800, Zach Brown wrote:
>> At the FS meeting at LCE there was some talk of doing O_DIRECT writes from 
>> the
>> kernel with pages instead of with iovecs.  T
> 
> Why ? Whats the use case ?

Well, I think there's a few:

There are existing callers which hold a kmap() across ->write, which
isn't great.  ecryptfs() does this.  That's mentioned in the patch
series.  Arguably loopback should be using this instead of copying some
fs paths and trying to call aop methods directly.

I seem to remember Christoph and David having stories of knfsd folks in
SGI wanting to do O_DIRECT writes from knfsd?  (If not, *I* kind of want
to, after rolling some patches to align net rx descriptors :)).

Lustre shows us that there is a point at which you can't saturate your
network and storage if your cpu is copying all the data.  I'll be the
first to admit that the community might not feel a pressing need to
address this for in-kernel file system writers, but the observation remains.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] add rwmem type backed by pages

2007-11-06 Thread Zach Brown

This lets callers specify a region of memory to read from or write to with
an array of page/offset/len tuples.  There have been specific requests to
do this from servers which want to do O_DIRECT from the kernel.  (knfsd?)

This could also be used by places which currently hold a kmap() and call
fop->write.  ecryptfs_write_lower_page_segment() is one such caller.
---
 fs/rwmem.c|   66 +
 include/linux/rwmem.h |   16 
 2 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/fs/rwmem.c b/fs/rwmem.c
index 0433ba4..c87e8a4 100644
--- a/fs/rwmem.c
+++ b/fs/rwmem.c
@@ -90,3 +90,69 @@ struct rwmem_ops rwmem_iovec_ops = {
.seg_bytes  = rwmem_iovec_seg_bytes,
.get_seg_pages  = rwmem_iovec_get_seg_pages,
 };
+
+void rwmem_pages_init(struct rwmem *rwm)
+{
+   struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+   struct pgol *pgol;
+   unsigned long i;
+
+   rwm->total_bytes = 0;
+   rwm->nr_pages = rwm->nr_segs;
+   rwm->boundary_bits = 0;
+
+   for (i = 0; i < rwm->nr_segs; i++) {
+   pgol = &rwp->pgol[i];
+
+   rwm->total_bytes += pgol->len;
+   rwm->boundary_bits |= pgol->offset | pgol->len;
+   }
+}
+
+/*
+ * Returns the offset of the start of a segment within its first page.
+ */
+unsigned long rwmem_pages_seg_page_offset(struct rwmem *rwm, unsigned long i)
+{
+   struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+   BUG_ON(i >= rwm->nr_segs);
+   return rwp->pgol[i].offset;
+}
+
+/*
+ * Returns the total bytes in the given segment.
+ */
+unsigned long rwmem_pages_seg_bytes(struct rwmem *rwm, unsigned long i)
+{
+   struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+   BUG_ON(i >= rwm->nr_segs);
+   return rwp->pgol[i].len;
+}
+
+/*
+ * For now each page is its own seg.
+ */
+int rwmem_pages_get_seg_pages(struct rwmem *rwm, unsigned long i,
+ unsigned long *cursor, struct page **pages,
+ unsigned long max_pages, int write)
+{
+   struct rwmem_pages *rwp = container_of(rwm, struct rwmem_pages, rwmem);
+   int ret = 0;
+
+   BUG_ON(i >= rwm->nr_segs);
+   BUG_ON(*cursor != 0);
+
+   if (max_pages) {
+   pages[0] = rwp->pgol[i].page;
+   get_page(pages[0]);
+   ret = 1;
+   }
+   return ret;
+}
+
+struct rwmem_ops rwmem_pages_ops = {
+   .init   = rwmem_pages_init,
+   .seg_page_offset= rwmem_pages_seg_page_offset,
+   .seg_bytes  = rwmem_pages_seg_bytes,
+   .get_seg_pages  = rwmem_pages_get_seg_pages,
+};
diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h
index 666f9f4..47019f0 100644
--- a/include/linux/rwmem.h
+++ b/include/linux/rwmem.h
@@ -26,4 +26,20 @@ struct rwmem_iovec {
 };
 struct rwmem_ops rwmem_iovec_ops;
 
+/*
+ * How many times do we need this in subsystems before we make a universal
+ * struct?  (bio_vec, skb_frag_struct, pipe_buffer)
+ */
+struct pgol {
+   struct page *page;
+   unsigned int offset;
+   unsigned int len;
+};
+
+struct rwmem_pages {
+   struct rwmemrwmem;
+   struct pgol *pgol;
+};
+struct rwmem_ops rwmem_pages_ops;
+
 #endif
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] struct rwmem: an abstraction of the memory argument to read/write

2007-11-06 Thread Zach Brown

This adds a structure and interface to represent the segments of memory
which are acting as the source or destination for a read or write operation.

Callers would fill this structure and then pass it down the rw path.

The intent is to let stages in the rw path make specific calls against this
API and structure instead of working with, say, struct iovec natively.

The main intent of this is to enable kernel calls into the rw path which
specify memory with page/offset/len tuples.

Another potential benefit of this is the reduction in iterations over iovecs at
various points in the kernel.  Each iov_length(iov) call, for example, could be
translated into rwm->total_bytes.  O_DIRECTs check of memory alignment is
changed into a single test against rwm->boundary_bits.

I imagine this might integrate well with the iov_iter interface, though I
haven't examined that in any depth.
---
 fs/Makefile   |2 +-
 fs/rwmem.c|   92 +
 include/linux/rwmem.h |   29 +++
 3 files changed, 122 insertions(+), 1 deletions(-)
 create mode 100644 fs/rwmem.c
 create mode 100644 include/linux/rwmem.h

diff --git a/fs/Makefile b/fs/Makefile
index 500cf15..c342365 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o rwmem.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/rwmem.c b/fs/rwmem.c
new file mode 100644
index 000..0433ba4
--- /dev/null
+++ b/fs/rwmem.c
@@ -0,0 +1,92 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static inline unsigned long pages_spanned(unsigned long addr,
+ unsigned long bytes)
+{
+   return ((addr + bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) -
+   (addr >> PAGE_SHIFT);
+}
+
+void rwmem_iovec_init(struct rwmem *rwm)
+{
+   struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+   struct iovec *iov;
+   unsigned long i;
+
+   rwm->total_bytes = 0;
+   rwm->nr_pages = 0;
+   rwm->boundary_bits = 0;
+
+   for (i = 0; i < rwm->nr_segs; i++) {
+   iov = &rwi->iov[i];
+
+   rwm->total_bytes += iov->iov_len;
+   rwm->nr_pages += pages_spanned((unsigned long)iov->iov_base,
+   iov->iov_len);
+   rwm->boundary_bits |= (unsigned long)iov->iov_base |
+ (unsigned long)iov->iov_len;
+   }
+}
+
+/*
+ * Returns the offset of the start of a segment within its first page.
+ */
+unsigned long rwmem_iovec_seg_page_offset(struct rwmem *rwm, unsigned long i)
+{
+   struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+   BUG_ON(i >= rwm->nr_segs);
+   return (unsigned long)rwi->iov[i].iov_base & ~PAGE_MASK;
+}
+
+/*
+ * Returns the total bytes in the given segment.
+ */
+unsigned long rwmem_iovec_seg_bytes(struct rwmem *rwm, unsigned long i)
+{
+   struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+   BUG_ON(i >= rwm->nr_segs);
+   return rwi->iov[i].iov_len;
+}
+
+int rwmem_iovec_get_seg_pages(struct rwmem *rwm, unsigned long i,
+ unsigned long *cursor, struct page **pages,
+ unsigned long max_pages, int write)
+{
+   struct rwmem_iovec *rwi = container_of(rwm, struct rwmem_iovec, rwmem);
+   struct iovec *iov;
+   int ret;
+
+   BUG_ON(i >= rwm->nr_segs);
+   iov = &rwi->iov[i];
+
+   if (*cursor == 0)
+   *cursor = (unsigned long)iov->iov_base;
+
+   max_pages = min(pages_spanned(*cursor, iov->iov_len -
+ (*cursor - (unsigned long)iov->iov_base)),
+   max_pages);
+
+   down_read(¤t->mm->mmap_sem);
+   ret = get_user_pages(current, current->mm, *cursor, max_pages, write,
+0, pages, NULL);
+   up_read(¤t->mm->mmap_sem);
+
+   if (ret > 0) { 
+   *cursor += ret * PAGE_SIZE;
+   if (*cursor >= (unsigned long)iov->iov_base + iov->iov_len)
+   *cursor = ~0;
+   }
+
+   return ret;
+}
+
+struct rwmem_ops rwmem_iovec_ops = {
+   .init   = rwmem_iovec_init,
+   .seg_page_offset= rwmem_iovec_seg_page_offset,
+   .seg_bytes  = rwmem_iovec_seg_bytes,
+   .get_seg_pages  = rwmem_iovec_get_seg_pages,
+};
diff --git a/include/linux/rwmem.h b/include/linux/rwmem.h
new file mode 100644
index 000..666f9f4
--- /dev/null
+++ b/include/linux/rwmem.h
@@ -0,0 +1,29 @@
+#ifndef

[PATCH 2/4] dio: use rwmem to work with r/w memory arguments

2007-11-06 Thread Zach Brown

This switches dio to work with the rwmem api to get memory pages for the IO
instead of working with iovecs directly.

It can use direct rwm struct accesses for some static universal properties of a
set of memory segments that make up the buffer argument.

It uses helper functions to work with the underlying data structures directly.
---
 fs/direct-io.c |  123 +---
 1 files changed, 55 insertions(+), 68 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index acf0da1..0d5ed41 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -34,7 +34,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 /*
@@ -105,11 +105,12 @@ struct dio {
sector_t cur_page_block;/* Where it starts */
 
/*
-* Page fetching state. These variables belong to dio_refill_pages().
+* Page fetching state. direct_io_worker() sets these for
+* dio_refill_pages() who modifies them as it fetches.
 */
-   int curr_page;  /* changes */
-   int total_pages;/* doesn't change */
-   unsigned long curr_user_address;/* changes */
+   struct rwmem *rwm;
+   unsigned long cur_seg;
+   unsigned long cur_seg_cursor;
 
/*
 * Page queue.  These variables belong to dio_refill_pages() and
@@ -146,21 +147,11 @@ static inline unsigned dio_pages_present(struct dio *dio)
  */
 static int dio_refill_pages(struct dio *dio)
 {
+   struct rwmem *rwm = dio->rwm;
int ret;
-   int nr_pages;
-
-   nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-   down_read(¤t->mm->mmap_sem);
-   ret = get_user_pages(
-   current,/* Task for fault acounting */
-   current->mm,/* whose pages? */
-   dio->curr_user_address, /* Where from? */
-   nr_pages,   /* How many pages? */
-   dio->rw == READ,/* Write to memory? */
-   0,  /* force (?) */
-   &dio->pages[0],
-   NULL);  /* vmas */
-   up_read(¤t->mm->mmap_sem);
+
+   ret = rwm->ops->get_seg_pages(rwm, dio->cur_seg, &dio->cur_seg_cursor,
+ dio->pages, DIO_PAGES, dio->rw == READ);
 
if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
struct page *page = ZERO_PAGE(0);
@@ -180,8 +171,6 @@ static int dio_refill_pages(struct dio *dio)
}
 
if (ret >= 0) {
-   dio->curr_user_address += ret * PAGE_SIZE;
-   dio->curr_page += ret;
dio->head = 0;
dio->tail = ret;
ret = 0;
@@ -938,11 +927,9 @@ out:
  */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
-   const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
-   unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
-   struct dio *dio)
+   struct rwmem *rwm, loff_t offset, unsigned blkbits,
+   get_block_t get_block, dio_iodone_t end_io, struct dio *dio)
 {
-   unsigned long user_addr; 
unsigned long flags;
int seg;
ssize_t ret = 0;
@@ -966,44 +953,33 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode 
*inode,
spin_lock_init(&dio->bio_lock);
dio->refcount = 1;
 
+   dio->rwm = rwm;
+
/*
 * In case of non-aligned buffers, we may need 2 more
 * pages since we need to zero out first and last block.
 */
+   dio->pages_in_io = rwm->nr_pages;
if (unlikely(dio->blkfactor))
-   dio->pages_in_io = 2;
-
-   for (seg = 0; seg < nr_segs; seg++) {
-   user_addr = (unsigned long)iov[seg].iov_base;
-   dio->pages_in_io +=
-   ((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE
-   - user_addr/PAGE_SIZE);
-   }
+   dio->pages_in_io += 2;
 
-   for (seg = 0; seg < nr_segs; seg++) {
-   user_addr = (unsigned long)iov[seg].iov_base;
-   dio->size += bytes = iov[seg].iov_len;
+   for (seg = 0; seg < rwm->nr_segs; seg++) {
+   dio->size += bytes = rwm->ops->seg_bytes(rwm, seg);
 
/* Index into the first page of the first block */
-   dio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits;
+   dio->first_block_in_page = 
+   rwm->ops->seg_page_offset(rwm, seg) >> blkbits;
dio->final_block_in_request = dio->block_in_file +
(bytes >> blkbits);
/* Page fetching state */
+   dio->cur_seg = seg;
+   dio->cur_seg_cursor = 0;
dio->head = 0;
dio->tail = 0;
-

[PATCH 4/4] add dio interface for page/offset/len tuples

2007-11-06 Thread Zach Brown

This is what it might look like to feed pgol in to some part of the fs stack
instead of iovecs.  I imagine we'd want to do it at a much higher level, perhaps
something like vfs_write_pages().
---
 fs/direct-io.c |   21 +
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 0d5ed41..e86bcbc 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1213,3 +1213,24 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct 
inode *inode,
 }
 
 EXPORT_SYMBOL(__blockdev_direct_IO);
+
+ssize_t
+__blockdev_direct_IO_pages(int rw, struct kiocb *iocb, struct inode *inode,
+   struct block_device *bdev, struct pgol *pgol, loff_t offset, 
+   unsigned long nr_pages, get_block_t get_block, dio_iodone_t end_io,
+   int dio_lock_type)
+{
+   struct rwmem_pages rwp = {
+   .rwmem.ops = &rwmem_pages_ops,
+   .rwmem.nr_segs = nr_pages,
+   .pgol = pgol,
+   };
+   struct rwmem *rwm = &rwp.rwmem;
+
+   rwm->ops->init(rwm);
+
+   return blockdev_direct_IO_rwmem(rw, iocb, inode, bdev, rwm, offset,
+   get_block, end_io, dio_lock_type);
+}
+
+EXPORT_SYMBOL(__blockdev_direct_IO_pages);
-- 
1.5.2.2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] fs io with struct page instead of iovecs

2007-11-06 Thread Zach Brown

At the FS meeting at LCE there was some talk of doing O_DIRECT writes from the
kernel with pages instead of with iovecs.  This patch series explores one
direction we could head in to achieve this.

We obviously can't just translate user iovecs (which might represent more
memory than the machine has) to pinned pages and pass page struct pointers all
the way down the stack.

And we don't want to duplicate a lot of the generic FS machinery down a
duplicate path which works with pages instead of iovecs.

So this patch series heads in the direction of abstracting out the memory
argument to the read and write calls.  It's still based on segments, but we
hide that it's either iovecs or arrays of page pointers from the callers in the
rw stack.

I didn't go too nuts with the syntactic sugar.  This is just intended to show
the basic mechanics.  We can obviously pretty it up if we think this is a sane
thing to be doing.

The series builds but has never been run.

What do people (that's you, Christoph) think?  I'm flexible.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-31 Thread Zach Brown


> The second use case is to look at the physical layout of blocks on disk
> for a specific file, use Mark Lord's write_long patches to inject a disk
> error and then read that file to make sure that we are handling disk IO
> errors correctly.  A bit obscure, but really quite useful.

Hmm, yeah, that's interesting.

> We have also used FIBMAP a few times to try and map an observed IO error
> back to a file. Really slow and painful to do, but should work on any
> file system when a better method is not supported.

We're getting off of this FIBMAP topic, but this interests me.  Can we
explore this a little?  How did you find out about the error without
having a file to associate with it?  Drive scrubbing, or some such?

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Zach Brown


> But, we shouldn't inflict all of this on fibmap/fiemapwe'll get
> lost trying to make the one true interface for all operations.
> 
> For grouping operations on files, I think a read_tree syscall with
> hints for what userland will do (read, stat, delete, list
> filenames), and a better cookie than readdir should do it.

Agreed, on both points.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Zach Brown


> Can you clarify what you mean above with an example?  I don't really
> follow.

Sure, take 'tar' as an example.  It'll read files in the order that
their names are returned from directory listing.  This can produce bad
IO patterns because the order in which the file names are returned
doesn't match the order of the file's blocks on disk.  (htree, I'm
looking at you!)

People have noticed that tar-like loads can be sped up greatly just by
sorting the files by their inode number as returned by stat(), never
mind the file blocks themselves.  One example of this is Chris Mason's
'acp'.

  http://oss.oracle.com/~mason/acp/

The logical extension of that is to use FIBMAP to find the order of file
blocks on disk and then doing IO on blocks in sorted order.  It'd take
work to write an app that does this reliably, sure.

In this use the application doesn't actually care what the absolute
numbers are.  It cares about their ordering.  File systems would be able
to chose whatever scheme they wanted for the actual values of the
results from a FIBMAP-alike as long as the sorting resulted in the right
IO patterns.

Arguing that this use is significant enough to justify an addition to
the file system API is a stretch.  I'm just sharing the observation.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Zach Brown


>> And another of my pet peeves with ->bmap is that it uses 0 to mean  
>> "sparse" which causes a conflict on NTFS at least as block zero is  
>> part of the $Boot system file so it is a real, valid block...  NTFS  
>> uses -1 to denote sparse blocks internally.
> 
> Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
> was a way to indicate your-data-is-here-but-isn't-alone.  But that's
> more of a feature for the FIEMAP stuff.

And maybe we can step back and see what the callers of FIBMAP are doing
with the results they're getting.

One use is to discover the order in which to read file data that will
result in efficient IO.

If we had an interface specifically for this use case then perhaps a
sparse block would be better reported as the position of the inode
relative to other data blocks.  Maybe the inode block number in ext* land.

This use is interesting to me because it can be useful for any file
system, particularly networked file systems.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]

2007-10-25 Thread Zach Brown

Roland Dreier wrote:
>  > > +static inline void *ERR_CAST(const void *ptr)
>  > > +{
>  > > +return (void *) ptr;
>  > > +}
>  > 
>  > Just to nit, surely you don't need the cast inside the function.  The
>  > casting happens at the call site between the argument and returned pointer.
> 
> The way it's written you kinda do, since it takes a const void * and
> returns a plain void *.

Ah, right, I missed that.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]

2007-10-25 Thread Zach Brown


> + * ERR_CAST - Explicitly cast an error-valued pointer to another pointer type
> + * @ptr: The pointer to cast.
> + *
> + * Explicitly cast an error-valued pointer to another pointer type in such a
> + * way as to make it clear that's what's going on.
> + */
> +static inline void *ERR_CAST(const void *ptr)
> +{
> + return (void *) ptr;
> +}

Just to nit, surely you don't need the cast inside the function.  The
casting happens at the call site between the argument and returned pointer.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()

2007-10-01 Thread Zach Brown




return ERR_PTR(PTR_ERR(inode));

I tend to prefer the latter.


It seems like a pretty noisy way to get a (void *) cast :/.  Maybe a  
function that has the cast but makes sure it's only used for IS_ERR()  
pointers?


/* haha, continuing the fine tradition of terrible names in this  
api..  */

static inline void *PTR_PTR(void *err_ptr) {
BUG_ON(!IS_ERR(err_ptr) || !err_ptr);
return err_ptr;
}

Meh.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()

2007-10-01 Thread Zach Brown

If you're soliciting opinions, I think I tend to prefer the feel of the
code paths after the changes.  I don't know the benefits of the change
are worth the risk in unmaintained file systems, though.

> + return ERR_PTR(PTR_ERR(inode));

This caught my eye.  Surely we can do better :).  It seems to happen a
few times in the patches, the instance in this patch was the first that
I noticed.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Zach Brown

> I fear the consequences of this change :(

I love it.  In the past I've lost time by working with patches which
didn't quite realize that ext3 holds a transaction open during
->direct_IO.

> Oh well, please keep it alive, maybe beat on it a bit, resend it
> later on?

I can test the patch to make sure that it catches mistakes I've made in
the past.  Peter, do you have any interest in seeing how far we can get
at tracking lock_page()?  I'm not holding my breath, but any little bit
would probably help.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vm/fs meetup details

2007-07-05 Thread Zach Brown


- repair driven design, we know what it is (Val told us), but
  how does it apply to the things we are currently working on?
  should we do more of it?


I'm sure Chris and I could talk about the design elements in btrfs  
that should aid repair if folks are interested in hearing about  
them.  We'd keep the hand-waving to a minimum :).


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] dio: remove bogus refcounting BUG_ON

2007-07-05 Thread Zach Brown


the BUG_ON(). But unfortunately, our perf. team is able reproduce the
problem.


What are they doing to reproduce it?  How much setup does it take?


Debug indicated that, the ret2 == 1 :(


That could be consistent with the theory that we're racing with the  
dio struct being freed and reused before it's tested in the BUG_ON()  
condition.  Suparna's suggestion to sample dio->is_async before  
releasing the refcount and using that in the BUG_ON condition is a  
good one.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] dio: remove bogus refcounting BUG_ON

2007-07-03 Thread Zach Brown

Linus, Andrew, please apply the bug fix patch at the end of this reply
for .22.

> >>One of our perf. team ran into this while doing some runs.
> >>I didn't see anything obvious - it looks like we converted
> >>async IO to synchronous one. I didn't spend much time digging
> >>around.

OK, I think this BUG_ON() is just broken.  I wasn't able to find any
obvious bugs from reading the code which would cause the BUG_ON() to
fire.  If it's reproducible I'd love to hear what the recipe is.

I did notice that this BUG_ON() is evaluating dio after having dropped
it's ref :/.  So it's not completely absurd to fear that it's a race
with the dio's memory being reused, but that'd be a pretty tight race.

Let's remove this stupid BUG_ON and see if that test box still has
trouble.  It might just hit the valid BUG_ON a few lines down, but this
unsafe BUG_ON needs to go.

---

dio: remove bogus refcounting BUG_ON

Badari Pulavarty reported a case of this BUG_ON is triggering during
testing.  It's completely bogus and should be removed.

It's trying to notice if we left references to the dio hanging around in
the sync case.  They should have been dropped as IO completed while this
path was in dio_await_completion().  This condition will also be
checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
lines lower.  So to start this BUG_ON() is redundant.

More fatally, it's dereferencing dio-> after having dropped its
reference.  It's only safe to dereference the dio after releasing the
lock if the final reference was just dropped.  Another CPU might free
the dio in bio completion and reuse the memory after this path drops the
dio lock but before the BUG_ON() is evaluated.

This patch passed aio+dio regression unit tests and aio-stress on ext3.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
Cc: Badari Pulavarty <[EMAIL PROTECTED]>

diff -r 509ce354ae1b fs/direct-io.c
--- a/fs/direct-io.cSun Jul 01 22:00:49 2007 +
+++ b/fs/direct-io.cTue Jul 03 14:56:41 2007 -0700
@@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i
spin_lock_irqsave(&dio->bio_lock, flags);
ret2 = --dio->refcount;
spin_unlock_irqrestore(&dio->bio_lock, flags);
-   BUG_ON(!dio->is_async && ret2 != 0);
+
if (ret2 == 0) {
ret = dio_complete(dio, offset, ret);
kfree(dio);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: DIO panic on 2.6.21.5

2007-06-29 Thread Zach Brown



On Jun 27, 2007, at 8:01 PM, Badari Pulavarty wrote:


Hi Zach,

One of our perf. team ran into this while doing some runs.
I didn't see anything obvious - it looks like we converted
async IO to synchronous one. I didn't spend much time digging
around.


It looks pretty bad, a *shouldn't happen* kind of case.  I'm sure it  
just means I missed some case :/.



Is this a known issue ? Any ideas ?


I haven't seen it before, no.  Do they have a reliable recipe to  
reproduce it?


In any case, I'll dig in early next week when I'm back from OLS.  I'm  
pretty confident that we'll be able to find the case.  Please let me  
know if it's urgent and I should try and find time before then.


Thanks for reporting it.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vm/fs meetup in september?

2007-06-25 Thread Zach Brown

> > I'd just like to take the chance also to ask about a VM/FS meetup some
> > time around kernel summit (maybe take a big of time during UKUUG or so).

Yeah, I'd be interested.

> More issues:

- chris mason's patches to normalize buffered and direct locking

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] eCryptfs: Delay writing 0's after llseek until write

2007-05-22 Thread Zach Brown

> FWIW, I believe Andrew's point was that critical information for Joe
> Enduser (and Joe Patch-Ho) was lacking in the original changelog.

and don't forget Joe eCryptfs-Maintainer-2-Years-In-The-Future.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/3] fs: add an iovec iterator

2007-02-09 Thread Zach Brown

What I have there is not actually a full-blown file io descriptor,  
because
there is no file or offset. It is just an iovec iterator (so maybe  
I should

rename it to iov_iter, rather than iodesc).

I think it might be a nice idea to keep this iov_iter as a standalone
structure, and it could be embedded into a struct file_io?


Yeah, maybe.

I'm not sure we need something as generic as a "file_io" struct.

To recap, I've hoped for the expression of the state of an iovec  
array with a richer structure to avoid the multiple walks of the  
array at different parts of the kernel that previously only had  
access to the raw iovec * and size_t count.


Stuff like the alignment checks in __blockdev_direct_IO() and the  
pages_in_io accounting in direct_io_worker().


I imagined building up the state in this 'iodesc' structure as we  
first copied and verified the structure from userspace.  (say in  
rw_copy_check_uvector()).


If as we copied we, say, stored in the bits of the buffer and length  
fields then by the time we got to __blockdev_direct_IO() we'd just  
test the bits for misalignment instead of iterating over the array  
again.


It starts to get gross as some paths currently modify the kernel copy  
of the array as they process it :/.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking

2007-02-07 Thread Zach Brown



The test case Linus sent me boils down to this:

fd = open(file)
buffer = mmap(fd, 128 pages);
close(fd);
fd = open(file, O_DIRECT);
write(fd, buffer, 66 pages);


Yeah, though I bet the inner close/open isn't needed.


I think the deadlock is limited to cases where get_user_pages will get
stuck in filemap_nopage waiting for placeholders inserted by this DIO.


Yeah.

It looks like that can only happen when buffer is mapped at the  
start of

the dio.


At the *start* of the dio or by the time get_user_pages() is called?

The dio and mmap() aren't serialized, are they?  mmap() just sets up  
the vma, I thought, and will only touch the mmap_sem.


I'm fearing threads racing write(fd, buffer, ) and mmap(buffer,  
MAP_FIXED...).


I might just be missing the locking that serializes them.  If nothing  
else, this should be mentioned in the comment above the code that  
looks like a racy test against the presence of a mapping.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking

2007-02-07 Thread Zach Brown




+static void dio_unlock_page_range(struct dio *dio)
+{
+   if (dio->lock_type != DIO_NO_LOCKING) {
+   remove_placeholder_pages(dio->inode->i_mapping,
+dio->fspages_start_off,
+dio->fspages_end_off);
+   dio->fspages_end_off = dio->fspages_start_off;


I wonder if it should call remove_placeholder_pages when _end_off !=  
_start_off.  That feels like a more direct test.



+   max_size = (dio->fspages_end_off - index) << PAGE_CACHE_SHIFT;
+   if (map_bh->b_size > max_size)
+   map_bh->b_size = max_size;


b_size = min(b_size, math) ?


+   /* if the mapping is mapped, they may be using a mmap'd portion
+* of the file as the buffer for this io.  That will deadlock
+* with placeholders because the placeholder code forces the
+* page fault handler to block.  The (ugly) solution is to
+* limit the span of inserted placeholders to the same
+* increment we use for get_user_pages.
+*/
+   if (inode->i_mapping->nrpages ||
+   mapping_mapped(inode->i_mapping))
+   dio->fspages_span = DIO_PAGES;
+   else
+   dio->fspages_span = ULONG_MAX;


Couldn't the decision to use DIO_PAGES in the presence of a mapping  
race with establishing a mapping?  I fear we have to use DIO_PAGES  
all the time, which at least has the benefit of getting rid of the  
_span member of dio :).


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1 of 8] Introduce a place holder page for the pagecache

2007-02-07 Thread Zach Brown

Placeholders can span a range bigger than one page.  The  
placeholder is
inserted into the radix slot for the end of the range, and the  
flags field in

the page struct is used to record the start of the range.


Not surprisingly, I like how this makes the interaction between  
buffered and O_DIRECT a lot less magical.  Nice work.


Some comments:


+void wait_on_placeholder_pages_range(struct address_space *mapping,
+  pgoff_t start, pgoff_t end)


...


+   while(start <= end) {


...


+   start = pages[i]->index + 1;
+   if (pages[i]->index > end)
+   goto done;


It looks like this might be confused by a end and final page->index  
of ~0?  Maybe it'd be safer to explicitly deal with this here instead  
of relying on safe arguments.



+   }
+   if (need_resched()) {
+   read_unlock_irq(&mapping->tree_lock);
+   cond_resched();
+   read_lock_irq(&mapping->tree_lock);
+   }


We should introduce a cond_sched_lock_irq() to match cond_resched_lock 
().



+void remove_placeholder_pages(struct address_space *mapping,
+unsigned long start,
+unsigned long end)
+{
+   struct page *page;
+   int ret;
+   int i;
+   unsigned long num;
+   struct page *pages[8];
+
+   write_lock_irq(&mapping->tree_lock);
+   while (start < end) {


...


+   write_unlock_irq(&mapping->tree_lock);


And call that cond_resched_lock_irq() in here too?  Since we grabbing  
the lock in here, presumably the caller won't mind if we grab and  
release it a few times?



+static int insert_placeholder(struct address_space *mapping,
+struct page *insert)
+{


spin_lock_assert(&mapping->tree_lock)?


+int find_or_insert_placeholders(struct address_space *mapping,
+ unsigned long start, unsigned long end,
+ gfp_t gfp_mask, int iowait)
+{


...


+   /*
+* this gets complicated.


It sure does!  Sprinkling a few comments above the main blocks in the  
loop might be nice.



+   err = filemap_write_and_wait_range(mapping,
+cur << PAGE_CACHE_SHIFT,
+end << PAGE_CACHE_SHIFT);


This end is exclusive (>= end breaks) but filemap_write_and_wait_range 
() -> wait_on_page_writeback_range() wants an argument that is  
inclusive (> end breaks)?



+   if (PagePlaceHolder(page)) {
+   DEFINE_WAIT(wait);
+   wait_queue_head_t *wqh = page_waitqueue(page);
+   prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+   read_unlock_irq(&mapping->tree_lock);
+   io_schedule();
+   finish_wait(wqh, &wait);
+   read_lock_irq(&mapping->tree_lock);


This pattern seems to be repeated quite a few times.  Maybe a helper  
would.. help?  Maybe it's not worth it if it has to do silly little  
pre-schedule tasks differently at each site.



-   for (i = 0; i < ret; i++)
-   page_cache_get(pages[i]);
+   while(i < ret) {
+   if (PagePlaceHolder(pages[i])) {
+   /* we can't return a place holder, shift it away */
+   if (i + 1 < ret) {
+   memcpy(&pages[i], &pages[i+1],
+  (ret - i - 1) * sizeof(struct page *));
+   }
+   ret--;
+   continue;
+   } else
+   page_cache_get(pages[i]);
+   i++;


It looks like this loop is implemented twice?  How about a factoring  
it out?  Especially with the risk of off-by-ones in there.  (Though I  
don't see problems, it looks correct.)


More on the rest of the series after lunch :).

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Heads up on a series of AIO patchsets

2007-01-04 Thread Zach Brown

generic_write_checks() are done in the submission path, not  
repeated during
retries, so such types of checks are not intended to run in the aio  
thread.


Ah, I see, I was missing the short-cut which avoids re-running parts  
of the write path if we're in a retry.


if (!is_sync_kiocb(iocb) && kiocbIsRestarted(iocb)) {
/* nothing to transfer, may just need to sync data */
return ocount;

It's pretty subtle that this has to be placed before the first  
significant current reference and that nothing in the path can return  
-EIOCBRETRY until after all of the significant current references.


In totally unrelated news, I noticed that current->io_wait is set to  
NULL instead of ¤t->__wait after having run the iocb.  I wonder  
if it shouldn't be saved and restored instead.  And maybe update the  
comment over is_sync_wait()?  Just an observation.



That is great and I look forward to it :) I am, however assuming that
whatever implementation you come up will have a different interface
from current linux aio -- i.e. a next generation aio model, that  
will be

easily integratable with kevents etc.


Yeah, that's the hope.

Which takes me back to Ingo's point - lets have the new evolve  
parallely

with the old, if we can, and not hold up the patches for POSIX AIO to
start using kernel AIO, or for epoll to integrate with AIO.


Sure, though there are only so many hours in a day :).


OK, I just took a quick look at your blog and I see that you
are basically implementing Linus' microthreads scheduling approach -
one year since we had that discussion.


Yeah.  I wanted to see what it would look like.


Glad to see that you found a way to make it workable ...


We, that remains to be seen.  If nothing else we'll at least hav  
code to point at when discussing it.  If we all agree it's not the  
right way and dismiss the notion, fine, that's progress :).



(I'm guessing that you are copying over the part
of the stack in use at the time of every switch, is that correct ?


That was my first pass, yeah.  It turned the knob a little too far  
towards the "invasive but efficient" direction for my taste.  I'm now  
giving it a try by having full stacks for each blocked op, we'll see  
how that goes.



At what
point do you do the allocation of the saved stacks ?


I was allocating at block-time to keep memory consumption down, but I  
think my fiddling around with it convinced me that isn't workable.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Zach Brown


Sorry for the delay, I'm finally back from the holiday break :)

(1) The filesystem AIO patchset, attempts to address one part of  
the problem
which is to make regular file IO, (without O_DIRECT)  
asynchronous (mainly
the case of reads of uncached or partially cached files, and  
O_SYNC writes).


One of the properties of the currently implemented EIOCBRETRY aio  
path is that ->mm is the only field in current which matches the  
submitting task_struct while inside the retry path.


It looks like a retry-based aio write path would be broken because of  
this.  generic_write_checks() could run in the aio thread and get its  
task_struct instead of that of the submitter.  The wrong rlimit will  
be tested and SIGXFSZ won't be raised.  remove_suid() could check the  
capabilities of the aio thread instead of those of the submitter.


I don't think EIOCBRETRY is the way to go because of this increased  
(and subtle!) complexity.  What are the chances that we would have  
ever found those bugs outside code review?  How do we make sure that  
current references don't sneak back in after having initially audited  
the paths?


Take the io_cmd_epoll_wait patch..

issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach  
Brown with
modifications from Jeff Moyer and me) addresses this problem  
for native

linux aio in a simple manner.


It's simple looking, sure.  This current flipping didn't even occur  
to me while throwing the patch together!


But that patch ends up calling ->poll (and poll_table->qproc) and  
writing to userspace (so potentially calling ->nopage) from the aio  
threads.  Are we sure that none of them will behave surprisingly  
because current changed under them?


It might be safe now, but that isn't really the point.  I'd rather we  
didn't have yet one more subtle invariant to audit and maintain.


At the risk of making myself vulnerable to the charge of mentioning  
vapourware, I will admit that I've been working on a (slightly mad)  
implementation of async syscalls.  I've been quiet about it because I  
don't want to whip up complicated discussion without being able to  
show code that works, even if barely.  I mention it now only to make  
it clear that I want to be constructive, not just critical :).


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EMAIL PROTECTED]: Re: [Linux-cluster] Re: [PATCH 1/3] dlm: use configfs]

2005-08-25 Thread Zach Brown


> Currently we don't support buffered aio on any filesystem in
> mainline, so adding crufty code to mainline sounds like a bad idea.
> Zab agreed on that and wants to remove it as much as it gets.

Yeah, we aim to simplify this code.  For the record, it wasn't buffered
aio that was the problem.  There were two naughty moving parts:

First, trying not to block in the dlm when issuing aio ops and tracking
state to restart after dlm ops returned eiocbqueued.  This was just
overly aggressive.  This can behave like block mapping lookups in that
it rarely blocks.  Most aio that people care about (direct io writes to
already allocated regions) will simply be acquiring and releasing
shared-read locks around each op -- trivial local operations.

Second, trying to hold dlm locks around the entirety of aio ops. This
led to the mess of trying to tear down locks in the iocb dtor method.
(which can then race with unmount, aio does __fput on the filp, dropping
the vfsmount ref, before calling dtor.. bleh). We can get around this
by unlocking after performing the block mapping lookups and issueing the
io and introducing a cluster DLM lock which behaves like i_alloc_sem.

So, how about a patch that lets the fs provide a callback to
acquire/release i_alloc_sem at the current sites (dio, notify_change)
that work with it? Most file systems wouldn't provide a callback and
the code would just use the sem as usual, but clustered guys could use
dlm locking.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] configfs, a filesystem for userspace-driven kernel object configuration

2005-04-05 Thread Zach Brown

Matt Mackall wrote:
> On Sun, Apr 03, 2005 at 12:57:28PM -0700, Joel Becker wrote:
> 

>>  An interface in /proc where the API is: 

>>or an ioctl(2) interface where the API is:

>>
>>becomes this in configfs:
>>
>>  # cd /config/mythingy
>>  # mkdir foo
>>  # echo 1 > foo/index
>>  # echo 3 > foo/count
>>  # echo 0x00013 > foo/address
>>
>>  Instead of a binary blob that's passed around or a cryptic
>>string that has to be formatted just so, configfs provides an interface
>>that's completely scriptable and navigable.
> 
> How does the kernel know when to actually create the object?

"actually create", huh? :)

In the trivial case Joel describes, the item is almost certainly
allocated during "# mkdir foo" when the subsystem will get a
->make_item() call for the 'mythingy' group it registerd.  The various
attribute writes then find the item by following their
configfs_attribute argument to the item that its embedded in.

But I bet you're not really asking about creation.  I bet you're
wondering how the kernel will know when enough attributes have been
filled and that it's safe to use the object.  Misguided items could
assign magical ordering to the attribute filling such that once a final
attribute is set, and others have been set, the item goes live.  That's
what ocfs2 does now, sadly, but certainly not as a long-term solution.

The missing piece is the 'commit_item' group operation that is yet to be
implemented.  The intent is to have a directory of pending items that
can have their attributes filled before being rename()ed into a
directory of items that are in active use.  The commit_item() call that
hits at rename() would give the kernel the chance to refuse the item
because attributes haven't been filled in or conflict with existing
items, or whatever.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] configfs, a filesystem for userspace-driven kernel object configuration

2005-04-05 Thread Zach Brown

Arjan van de Ven wrote:
> On Sun, 2005-04-03 at 12:57 -0700, Joel Becker wrote:
> 
>>Folks,
>>  I humbly submit configfs.  With configfs, a configfs
>>config_item is created via an explicit userspace operation: mkdir(2).
>>It is destroyed via rmdir(2).  The attributes appear at mkdir(2) time,
>>and can be read or modified via read(2) and write(2).  readdir(3)
>>queries the list of items and/or attributes.
>>  The lifetime of the filesystem representation is completely
>>driven by userspace.  The lifetime of the objects themselves are managed
>>by a kref, but at rmdir(2) time they disappear from the filesystem.
> 
> 
> does that mean you rmdir a non-empty directory ??

Yeah, but only attributes and default groups are automatically torn
down.  You can't rmdir() an item that is the destination of links and
you can't rmdir() groups that still contain items.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Address space operations questions

2005-03-31 Thread Zach Brown

Bryan Henderson wrote:
>>So, semantics of ->sync_page() are roughly "kick underlying storage
>>driver to actually perform all IO queued for this page, and, maybe, for
>>other pages on this device too".
> 
> 
> I prefer to think of it in a more modular sense.  To preserve modularity, 
> the caller of sync_page() can't know anything about I/O scheduling.  So I 
> think the semantics of ->sync_page() are "Someone is about to wait for the 
> results of the previously requested write_page on this page."  It's 
> completely up to the owner of the address space to figure out what would 
> be appropriate to do given that information.

Though I agree with your desire for a "modular" interpretation, I'm
going disagree your description of the information sync_page() provides.
 Nikita's vague description is closer to the truth because sync_page()
is vague.

If you follow the callers of sync_page() you quickly find that what it
*really* means to be called in sync_page() is that you're being told
that some process is about to block on that page.  For what reason, you
can't know from the call alone.  Waiting for read to complete and
unlock?  Waiting for writeback to clear the writeback bit?  Some
processes just happened to race to lock_page() on the same page for
reasons that have nothing to do with IO?

And the not-so-initiated might think that sync_page() is called with the
page lock just like the other ops.  That's obviously not the case, as it
is called when the page lock can't be acquired, but it makes actually
*doing* something with that page argument perilous.  Synchronization is
left to the sync_page() callee.

For example, cachefs seems to be among the very few who don't provide
block_sync_page() directly.  All it does is printk() and then call
block_sync_page().  In that single line of code it derefs page->mapping
without checking and so can probably fatally race with truncation and
the nulling of page->mapping.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Efficient handling of sparse files

2005-02-28 Thread Zach Brown


> Please keep one thing in mind and that is that there are file systems 
> where ->bmap actually makes no sense whatsoever

Of course, so return -ESORRY.

> This is one of the reasons why noone should be using ->bmap.  It is a 
> stupid interface that only fits very particular sets of file systems and 
> cannot be applied generically.

No, it's a reason to only ask about the details of block mapping in
cases where it actually makes sense (like, wanting to find out of
concurrent file extension is getting good batched contiguous allocation,
etc).  Just because file systems x, y, and z can't answer the question
meaningfully doesn't mean it isn't a reasonble thing to ask of file
systems m, n, and o.

Now, I'm not at all opposed to an explicit sparse-testing interface that
doesn't confuse that functionality with querying specific block mappings.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Efficient handling of sparse files

2005-02-28 Thread Zach Brown


> I was wondering if we could introduce a new system call (or ioctl?) that,
> given an fd would find the next block with data in it.  We could use the
> ->bmap method ... except that has dire warnings about adding new callers
> and viro may soon be in testicle-gouging range.

Hmm.  What you're talking about reminds me of some ioctl()s Alex has for
ext3+extents and it feels like the pagevec apis that want to find
populated pages across the page cache index space.

Sooo.

struct fs_extent {
u64 file_start;
u64 block_start;
u64 contig;
};

(I don't really care if those are in bytes or blocks or whatever.
someone with strong opinions can pick a unit :))

long sys_find_extents_please(int fd, off_t file_start,
struct fs_extent *extents, long nr_extents);

so it'll fill in as many extent structs in the caller as it finds
contiguous regions on disk starting with the given file position,
returning the number populated.

I'd, somewhat obviously, want to push this into an fs method perhaps
with a generic_ that just spins on bmap().

I think this would let Alex kill his ioctl() and ocfs2 could certainly
fill this with reasonable results.

To move into lala land, I wonder if we would want to consider the
difference between mapped blocks with data and mapped blocks that
haven't been touched and which are going to return zeros. One could
argue that it's marginally ridiculous that there isn't a shared
interface to reserve blocks without having to manually zero them.  If we
did have such an interface, something like rsync doesn't actually care
that they're mapped if the fs knows that they're still just zeros.  I
acknowledge that this is a bit out there :)

In any case, if you want help rolling proofs-of-concept I could lend a
few hours.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

39 matches

Mail list logo