subject:"\[PATCH\]\[RFC\] fast file mapping for loop"

Re: [PATCH][RFC] fast file mapping for loop

2008-01-15 Thread Chris Mason

On Tue, 15 Jan 2008 11:07:40 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:

> > > I split and merged the patch into five bits (added ext3 support),
> > > so perhaps that would be easier for people to read/review.
> > > Attached and also exist in the loop-extent_map branch here:

Thanks!

> > > 
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map
> > 
> > Seems my ext3 version doesn't work, it craps out in
> > ext3_get_blocks_handle() triggering this bug:
> > 
> > J_ASSERT(handle != NULL || create == 0);
> > 
> > I'll see if I can fix that, being fairly fs ignorant...
> 
> This works, but probably pretty suboptimal (should end the new journal
> in map_io_complete()?). And yes I know the >> 9 isn't correct, since
> the fs block size is larger. Just making sure that we always have
> enough blocks.

You can use DIO_CREDITS instead of len >> 9, just like the ext3
O_DIRECT code does.  Your current patch is fine, except it breaks
data=ordered rules.  My plan to work within data=ordered:

1) Inside ext3_map_extent (while the transaction was running), increment
a counter in the ext3 journal for number of pending IOs.  Then end the
transaction handle.

2) Drop this counter inside the IO completion call

3) Change the ext3 commit code to wait for the IO count to be zero.

I'll give it a shot later this week, until then your current patch is
just data=writeback, which is good enough for testing.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-15 Thread Jens Axboe

On Tue, Jan 15 2008, Jens Axboe wrote:
> On Tue, Jan 15 2008, Jens Axboe wrote:
> > On Mon, Jan 14 2008, Jens Axboe wrote:
> > > On Mon, Jan 14 2008, Chris Mason wrote:
> > > > Hello everyone,
> > > > 
> > > > Here is a modified version of Jens' patch.  The basic idea is to push
> > > > the mapping maintenance out of loop and down into the filesystem (ext2
> > > > in this case).
> > > > 
> > > > Two new address_space operations are added, one to map
> > > > extents and the other to provide call backs into the FS as io is
> > > > completed.
> > > > 
> > > > Still TODO for this patch:
> > > > 
> > > > * Add exclusion between filling holes and readers.  This is partly
> > > > implemented, when a hole is filled by the FS, the extent is flagged as
> > > > having a hole.  The idea is to check this flag before trying to read
> > > > the blocks and just send back all zeros.
> > > > 
> > > > The flag would be cleared when the blocks filling the hole have been
> > > > written.
> > > > 
> > > > * Exclude page cache readers and writers
> > > > 
> > > > * Add a way for the FS to request a commit before telling the higher
> > > > layers the IO is complete.  This way we can make sure metadata related
> > > > to holes is on disk before claiming the IO is really done.  COW based
> > > > filesystems will also needed it.
> > > > 
> > > > * Change loop to use fast mapping only when the new address_space op is
> > > > provided (whoops, forgot about this until just now)
> > > > 
> > > > * A few other bits for COW, not really relevant until there
> > > > is...something COW using it.
> > > 
> > > Looks pretty good. Essentially the loop side is 100% the same, it just
> > > offloads the extent ownership to the fs (where it belongs). So I like
> > > it. Attaching a small cleanup/fixup patch for loop, don't think it needs
> > > further explanations.
> > > 
> > > One suggestion - free_extent_map(), I would call that put_extent_map()
> > > instead.
> > 
> > I split and merged the patch into five bits (added ext3 support), so
> > perhaps that would be easier for people to read/review. Attached and
> > also exist in the loop-extent_map branch here:
> > 
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map
> 
> Seems my ext3 version doesn't work, it craps out in
> ext3_get_blocks_handle() triggering this bug:
> 
> J_ASSERT(handle != NULL || create == 0);
> 
> I'll see if I can fix that, being fairly fs ignorant...

This works, but probably pretty suboptimal (should end the new journal
in map_io_complete()?). And yes I know the >> 9 isn't correct, since the
fs block size is larger. Just making sure that we always have enough
blocks.

Punting to Chris!

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 55e677d..e97181a 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1002,11 +1002,25 @@ static struct extent_map *ext3_map_extent(struct 
address_space *mapping,
  gfp_t gfp_mask)
 {
struct extent_map_tree *tree = &EXT3_I(mapping->host)->extent_tree;
+   handle_t *handle = NULL;
+   struct extent_map *ret;
 
-   return map_extent_get_block(tree, mapping, start, len, create, gfp_mask,
+   if (create) {
+   handle = ext3_journal_start(mapping->host, len >> 9);
+   if (IS_ERR(handle))
+   return (struct extent_map *) handle;
+   }
+
+   ret = map_extent_get_block(tree, mapping, start, len, create, gfp_mask,
ext3_get_block);
+   if (handle)
+   ext3_journal_stop(handle);
+
+   return ret;
 }
 
+
+
 /*
  * `handle' can be NULL if create is zero
  */

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-15 Thread Jens Axboe

On Tue, Jan 15 2008, Jens Axboe wrote:
> On Mon, Jan 14 2008, Jens Axboe wrote:
> > On Mon, Jan 14 2008, Chris Mason wrote:
> > > Hello everyone,
> > > 
> > > Here is a modified version of Jens' patch.  The basic idea is to push
> > > the mapping maintenance out of loop and down into the filesystem (ext2
> > > in this case).
> > > 
> > > Two new address_space operations are added, one to map
> > > extents and the other to provide call backs into the FS as io is
> > > completed.
> > > 
> > > Still TODO for this patch:
> > > 
> > > * Add exclusion between filling holes and readers.  This is partly
> > > implemented, when a hole is filled by the FS, the extent is flagged as
> > > having a hole.  The idea is to check this flag before trying to read
> > > the blocks and just send back all zeros.
> > > 
> > > The flag would be cleared when the blocks filling the hole have been
> > > written.
> > > 
> > > * Exclude page cache readers and writers
> > > 
> > > * Add a way for the FS to request a commit before telling the higher
> > > layers the IO is complete.  This way we can make sure metadata related
> > > to holes is on disk before claiming the IO is really done.  COW based
> > > filesystems will also needed it.
> > > 
> > > * Change loop to use fast mapping only when the new address_space op is
> > > provided (whoops, forgot about this until just now)
> > > 
> > > * A few other bits for COW, not really relevant until there
> > > is...something COW using it.
> > 
> > Looks pretty good. Essentially the loop side is 100% the same, it just
> > offloads the extent ownership to the fs (where it belongs). So I like
> > it. Attaching a small cleanup/fixup patch for loop, don't think it needs
> > further explanations.
> > 
> > One suggestion - free_extent_map(), I would call that put_extent_map()
> > instead.
> 
> I split and merged the patch into five bits (added ext3 support), so
> perhaps that would be easier for people to read/review. Attached and
> also exist in the loop-extent_map branch here:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map

Seems my ext3 version doesn't work, it craps out in
ext3_get_blocks_handle() triggering this bug:

J_ASSERT(handle != NULL || create == 0);

I'll see if I can fix that, being fairly fs ignorant...

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-15 Thread Jens Axboe

On Mon, Jan 14 2008, Jens Axboe wrote:
> On Mon, Jan 14 2008, Chris Mason wrote:
> > Hello everyone,
> > 
> > Here is a modified version of Jens' patch.  The basic idea is to push
> > the mapping maintenance out of loop and down into the filesystem (ext2
> > in this case).
> > 
> > Two new address_space operations are added, one to map
> > extents and the other to provide call backs into the FS as io is
> > completed.
> > 
> > Still TODO for this patch:
> > 
> > * Add exclusion between filling holes and readers.  This is partly
> > implemented, when a hole is filled by the FS, the extent is flagged as
> > having a hole.  The idea is to check this flag before trying to read
> > the blocks and just send back all zeros.
> > 
> > The flag would be cleared when the blocks filling the hole have been
> > written.
> > 
> > * Exclude page cache readers and writers
> > 
> > * Add a way for the FS to request a commit before telling the higher
> > layers the IO is complete.  This way we can make sure metadata related
> > to holes is on disk before claiming the IO is really done.  COW based
> > filesystems will also needed it.
> > 
> > * Change loop to use fast mapping only when the new address_space op is
> > provided (whoops, forgot about this until just now)
> > 
> > * A few other bits for COW, not really relevant until there
> > is...something COW using it.
> 
> Looks pretty good. Essentially the loop side is 100% the same, it just
> offloads the extent ownership to the fs (where it belongs). So I like
> it. Attaching a small cleanup/fixup patch for loop, don't think it needs
> further explanations.
> 
> One suggestion - free_extent_map(), I would call that put_extent_map()
> instead.

I split and merged the patch into five bits (added ext3 support), so
perhaps that would be easier for people to read/review. Attached and
also exist in the loop-extent_map branch here:

http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map

-- 
Jens Axboe

>From b179b114b7ef6f4df1520e0013b8efe631ff577d Mon Sep 17 00:00:00 2001
From: Chris Mason <[EMAIL PROTECTED]>
Date: Tue, 15 Jan 2008 10:04:43 +0100
Subject: [PATCH] fs: add extent_map library

Signed-off-by: Jens Axboe <[EMAIL PROTECTED]>
---
 fs/Makefile|2 +-
 fs/dcache.c|2 +
 fs/extent_map.c|  402 
 include/linux/extent_map.h |   58 +++
 4 files changed, 463 insertions(+), 1 deletions(-)
 create mode 100644 fs/extent_map.c
 create mode 100644 include/linux/extent_map.h

diff --git a/fs/Makefile b/fs/Makefile
index 500cf15..01b37aa 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o
+		stack.o extent_map.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index d9ca1e5..edd9faa 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 
@@ -2170,6 +2171,7 @@ void __init vfs_caches_init(unsigned long mempages)
 
 	dcache_init();
 	inode_init();
+	extent_map_init();
 	files_init(mempages);
 	mnt_init();
 	bdev_cache_init();
diff --git a/fs/extent_map.c b/fs/extent_map.c
new file mode 100644
index 000..1e9ff6e
--- /dev/null
+++ b/fs/extent_map.c
@@ -0,0 +1,402 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static struct kmem_cache *extent_map_cache;
+
+int __init extent_map_init(void)
+{
+	extent_map_cache = kmem_cache_create("extent_map",
+	sizeof(struct extent_map), 0,
+	SLAB_MEM_SPREAD, NULL);
+	if (!extent_map_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void __exit extent_map_exit(void)
+{
+	if (extent_map_cache)
+		kmem_cache_destroy(extent_map_cache);
+}
+
+void extent_map_tree_init(struct extent_map_tree *tree)
+{
+	tree->map.rb_node = NULL;
+	tree->last = NULL;
+	rwlock_init(&tree->lock);
+}
+EXPORT_SYMBOL(extent_map_tree_init);
+
+struct extent_map *alloc_extent_map(gfp_t mask)
+{
+	struct extent_map *em;
+	em = kmem_cache_alloc(extent_map_cache, mask);
+	if (!em || IS_ERR(em))
+		return em;
+	atomic_set(&em->refs, 1);
+	em->flags = 0;
+	return em;
+}
+EXPORT_SYMBOL(alloc_extent_map);
+
+void free_extent_map(struct extent_map *em)
+{
+	if (!em)
+		return;
+	if (atomic_dec_and_test(&em->refs))
+		kmem_cache_free(extent_map_cache, em);
+}
+EXPORT_SYMBOL(free_extent_map);
+
+static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
+   struct rb_node *node)
+{
+	struct rb_node ** p = &root->rb_node;
+	struct rb_node * parent = NULL;
+	struct extent_map *entry;
+
+	while(*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct extent_map, rb_node);
+
+		if (offset < entry->start)
+			p = &(*p)->rb_le

Re: [PATCH][RFC] fast file mapping for loop

2008-01-14 Thread Jens Axboe

On Mon, Jan 14 2008, Chris Mason wrote:
> Hello everyone,
> 
> Here is a modified version of Jens' patch.  The basic idea is to push
> the mapping maintenance out of loop and down into the filesystem (ext2
> in this case).
> 
> Two new address_space operations are added, one to map
> extents and the other to provide call backs into the FS as io is
> completed.
> 
> Still TODO for this patch:
> 
> * Add exclusion between filling holes and readers.  This is partly
> implemented, when a hole is filled by the FS, the extent is flagged as
> having a hole.  The idea is to check this flag before trying to read
> the blocks and just send back all zeros.
> 
> The flag would be cleared when the blocks filling the hole have been
> written.
> 
> * Exclude page cache readers and writers
> 
> * Add a way for the FS to request a commit before telling the higher
> layers the IO is complete.  This way we can make sure metadata related
> to holes is on disk before claiming the IO is really done.  COW based
> filesystems will also needed it.
> 
> * Change loop to use fast mapping only when the new address_space op is
> provided (whoops, forgot about this until just now)
> 
> * A few other bits for COW, not really relevant until there
> is...something COW using it.

Looks pretty good. Essentially the loop side is 100% the same, it just
offloads the extent ownership to the fs (where it belongs). So I like
it. Attaching a small cleanup/fixup patch for loop, don't think it needs
further explanations.

One suggestion - free_extent_map(), I would call that put_extent_map()
instead.

diff -u b/drivers/block/loop.c b/drivers/block/loop.c
--- b/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -677,13 +677,14 @@
if (IS_ERR(lfe))
return -EIO;
 
-   while(!lfe) {
+   while (!lfe) {
loop_schedule_extent_mapping(lo, bio->bi_sector,
 bio->bi_size, 1);
lfe = loop_lookup_extent(lo, start, GFP_ATOMIC);
if (IS_ERR(lfe))
return -EIO;
}
+
/*
 * handle sparse io
 */
@@ -802,13 +803,13 @@
 {
struct bio *orig_bio = bio->bi_private;
struct inode *inode = bio->bi_bdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
u64 start = orig_bio->bi_sector << 9;
u64 len = bio->bi_size;
 
-   if (inode->i_mapping->a_ops->extent_io_complete) {
-   inode->i_mapping->a_ops->extent_io_complete(inode->i_mapping,
-   start, len);
-   }
+   if (mapping->a_ops->extent_io_complete)
+   mapping->a_ops->extent_io_complete(mapping, start, len);
+
bio_put(bio);
bio_endio(orig_bio, err);
 }
@@ -829,6 +830,7 @@
err = -ENOMEM;
goto out;
}
+
/*
 * change the sector so we can find the correct file offset in our
 * endio
@@ -847,7 +849,6 @@
goto out;;
}
 
-
disk_block = em->block_start;
extent_off = start - em->start;
new_bio->bi_sector = (disk_block + extent_off) >> 9;
@@ -924,11 +925,8 @@
spin_unlock_irq(&lo->lo_lock);
 
BUG_ON(!bio);
-   if (lo_act_bio(bio))
-   bio_act = 1;
-   else
-   bio_act = 0;
 
+   bio_act = lo_act_bio(bio);
loop_handle_bio(lo, bio);
 
spin_lock_irq(&lo->lo_lock);
@@ -1103,11 +1101,9 @@
return -EINVAL;
 
/*
-* Need a working bmap. TODO: use the same optimization that
-* direct-io.c does for get_block() mapping more than one block
-* at the time.
+* Need a working extent_map
 */
-   if (inode->i_mapping->a_ops->bmap == NULL)
+   if (inode->i_mapping->a_ops->map_extent == NULL)
return -EINVAL;
/*
 * invalidate all page cache belonging to this file, it could become
diff -u b/include/linux/loop.h b/include/linux/loop.h
--- b/include/linux/loop.h
+++ b/include/linux/loop.h
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 
 /* Possible states of device */
 enum {
@@ -72,9 +71,6 @@
struct gendisk  *lo_disk;
struct list_headlo_list;
 
-   struct prio_tree_root   prio_root;
-   struct prio_tree_node   *last_insert;
-   struct prio_tree_node   *last_lookup;
unsigned intblkbits;
 };
 

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-14 Thread Chris Mason

Hello everyone,

Here is a modified version of Jens' patch.  The basic idea is to push
the mapping maintenance out of loop and down into the filesystem (ext2
in this case).

Two new address_space operations are added, one to map
extents and the other to provide call backs into the FS as io is
completed.

Still TODO for this patch:

* Add exclusion between filling holes and readers.  This is partly
implemented, when a hole is filled by the FS, the extent is flagged as
having a hole.  The idea is to check this flag before trying to read
the blocks and just send back all zeros.

The flag would be cleared when the blocks filling the hole have been
written.

* Exclude page cache readers and writers

* Add a way for the FS to request a commit before telling the higher
layers the IO is complete.  This way we can make sure metadata related
to holes is on disk before claiming the IO is really done.  COW based
filesystems will also needed it.

* Change loop to use fast mapping only when the new address_space op is
provided (whoops, forgot about this until just now)

* A few other bits for COW, not really relevant until there
is...something COW using it.

-chris

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -76,6 +76,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -481,16 +482,51 @@ static int do_bio_filebacked(struct loop
return ret;
 }
 
+#define __lo_throttle(wq, lock, condition) \
+do {   \
+   DEFINE_WAIT(__wait);\
+   for (;;) {  \
+   prepare_to_wait((wq), &__wait, TASK_UNINTERRUPTIBLE);   \
+   if (condition)  \
+   break;  \
+   spin_unlock_irq((lock));\
+   io_schedule();  \
+   spin_lock_irq((lock));  \
+   }   \
+   finish_wait((wq), &__wait); \
+} while (0)\
+
+#define lo_act_bio(bio)((bio)->bi_bdev)
+#define LO_BIO_THROTTLE128
+
 /*
- * Add bio to back of pending list
+ * A normal block device will throttle on request allocation. Do the same
+ * for loop to prevent millions of bio's queued internally.
+ */
+static void loop_bio_throttle(struct loop_device *lo, struct bio *bio)
+{
+   if (lo_act_bio(bio))
+   __lo_throttle(&lo->lo_bio_wait, &lo->lo_lock,
+   lo->lo_bio_cnt < LO_BIO_THROTTLE);
+}
+
+/*
+ * Add bio to back of pending list and wakeup thread
  */
 static void loop_add_bio(struct loop_device *lo, struct bio *bio)
 {
+   loop_bio_throttle(lo, bio);
+
if (lo->lo_biotail) {
lo->lo_biotail->bi_next = bio;
lo->lo_biotail = bio;
} else
lo->lo_bio = lo->lo_biotail = bio;
+
+   if (lo_act_bio(bio))
+   lo->lo_bio_cnt++;
+
+   wake_up(&lo->lo_event);
 }
 
 /*
@@ -510,6 +546,178 @@ static struct bio *loop_get_bio(struct l
return bio;
 }
 
+static void loop_exit_fastfs(struct loop_device *lo)
+{
+   /*
+* drop what page cache we instantiated filling holes
+*/
+   invalidate_inode_pages2(lo->lo_backing_file->f_mapping);
+
+   blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_NONE, NULL);
+}
+
+static inline u64 lo_bio_offset(struct loop_device *lo, struct bio *bio)
+{
+   return (u64)lo->lo_offset + ((u64)bio->bi_sector << 9);
+}
+
+/*
+ * Find extent mapping this lo device block to the file block on the real
+ * device
+ */
+static struct extent_map *loop_lookup_extent(struct loop_device *lo,
+u64 offset, gfp_t gfp_mask)
+{
+   struct address_space *mapping;
+   struct extent_map *em;
+   u64 len = 1 << lo->blkbits;
+
+   mapping = lo->lo_backing_file->f_mapping;
+   em = mapping->a_ops->map_extent(mapping, NULL, 0,
+   offset, len, 0, gfp_mask);
+   return em;
+}
+
+/*
+ * Alloc a hint bio to tell the loop thread to read file blocks for a given
+ * range
+ */
+static void loop_schedule_extent_mapping(struct loop_device *lo,
+sector_t sector,
+unsigned long len, int wait)
+{
+   struct bio *bio, stackbio;
+
+   /*
+* it's ok if we occasionally fail. if called with blocking set,
+* then use an on-stack bio since that must not fail.
+*/
+   if (wait) {
+   bio = &stackbio;
+   bi

Re: [PATCH][RFC] fast file mapping for loop

2008-01-11 Thread Jens Axboe

On Fri, Jan 11 2008, Daniel Phillips wrote:
> Hi Jens,
> 
> This looks really useful.
> 
> On Wednesday 09 January 2008 00:52, Jens Axboe wrote:
> > Disadvantages:
> >
> > - The file block mappings must not change while loop is using the
> > file. This means that we have to ensure exclusive access to the file
> > and this is the bit that is currently missing in the implementation.
> > It would be nice if we could just do this via open(), ideas
> > welcome... 
> 
> Get_block methods are pretty fast and you have caching in the level 
> above you, so you might be able to get away with no cache of physical 
> addresses at all, in which case you just need i_mutex and i_alloc_sem 
> at get_block time.  This would save a pile of code and still have the 
> main benefit of avoiding double caching.

I'm not too fond of the tree either, but it serves an important purpose
as well - we need to be careful in calling bmap() as not to deadlock the
fs under vm pressure. So the current code punts to a thread for bmap()
on extents we don't already have stored in loop. And that slows things
down of course, we would have to still punt every IO to loopd instead of
just doing a quick remap. But...

> If you use ->get_block instead of bmap, it will fill in file holes for 
> you, but of course get_block is not exposed, and Al is likely to bark 
> at anyone who exposes it.
> 
> Instead of exposing get_block you could expose an aops method 
> like ->bio_transfer that would hide the use of *_get_block in a library 
> routine, just as __blockdev_direct_IO does.  Chances are, there are 
> other users besides loop that would be interested in a generic way of 
> performing bio transfers to files.
> 
> I presume you would fall back to the existing approach for any 
> filesystem without get_block.  You could handle this transparently with 
> a default library method that does read/write.

... things are already moving forward, Chris has a new interface for
this and tied it in with the loop fastfs mode. I think he'll post it
later today.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-11 Thread Daniel Phillips

Hi Jens,

This looks really useful.

On Wednesday 09 January 2008 00:52, Jens Axboe wrote:
> Disadvantages:
>
> - The file block mappings must not change while loop is using the
> file. This means that we have to ensure exclusive access to the file
> and this is the bit that is currently missing in the implementation.
> It would be nice if we could just do this via open(), ideas
> welcome... 

Get_block methods are pretty fast and you have caching in the level 
above you, so you might be able to get away with no cache of physical 
addresses at all, in which case you just need i_mutex and i_alloc_sem 
at get_block time.  This would save a pile of code and still have the 
main benefit of avoiding double caching.

If you use ->get_block instead of bmap, it will fill in file holes for 
you, but of course get_block is not exposed, and Al is likely to bark 
at anyone who exposes it.

Instead of exposing get_block you could expose an aops method 
like ->bio_transfer that would hide the use of *_get_block in a library 
routine, just as __blockdev_direct_IO does.  Chances are, there are 
other users besides loop that would be interested in a generic way of 
performing bio transfers to files.

I presume you would fall back to the existing approach for any 
filesystem without get_block.  You could handle this transparently with 
a default library method that does read/write.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-11 Thread Chris Mason

On Fri, 11 Jan 2008 10:01:18 +1100
Neil Brown <[EMAIL PROTECTED]> wrote:

> On Thursday January 10, [EMAIL PROTECTED] wrote:
> > On Thu, Jan 10 2008, Chris Mason wrote:
> > > On Thu, 10 Jan 2008 09:31:31 +0100
> > > Jens Axboe <[EMAIL PROTECTED]> wrote:
> > > 
> > > > On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > > > > Here's the latest version of dm-loop, for comparison.
> > > > > 
> > > > > To try it out, 
> > > > >   ln -s dmsetup dmlosetup
> > > > > and supply similar basic parameters to losetup.
> > > > > (using dmsetup version 1.02.11 or higher)
> > > > 
> > > > Why oh why does dm always insist to reinvent everything? That's
> > > > bad enough in itself, but on top of that most of the extra
> > > > stuff ends up being essentially unmaintained.
> > > 
> > > I don't quite get how the dm version is reinventing things.  They
> > > use
> > 
> > Things like raid, now file mapping functionality. I'm sure there are
> > more examples, it's how dm was always developed probably originating
> > back to when they developed mostly out of tree. And I think it's a
> > bad idea, we don't want duplicate functionality. If something is
> > wrong with loop, fix it, don't write dm-loop.
> 
> I'm with Jens here.
> 
> We currently have two interfaces that interesting block devices can be
> written for: 'dm' and 'block'.
> We really should aim to have just one.  I would call it 'block' and
> move anything really useful from dm into block.
> 
> As far as I can tell, the important things that 'dm' has that 'block'
> doesn't have are:
> 
>   - a standard ioctl interface for assembling and creating interesting
>  devices.
>  For 'block', everybody just rolls there own. e.g. md, loop, and
>  nbd all use totally different approaches for setup and tear down
>  etc. 
> 
>   - suspend/reconfigure/resume.
> This is something that I would really like to see in 'block'.  If
> I had a filesystem mounted on /dev/sda1 and I wanted to make it a
> raid1, it would be cool if I could
>   suspend /dev/sda1
>   build a raid1 from sda1 and something else
>   plug tha raid1 in as 'sda1'.
>   resume sda1
> 
>   - Integrated 'linear' mapping.
> This is the bit of 'dm' that I think of as yucky.  If I read the
> code correctly, every dm device is a linear array of a bunch of
> targets.  Each target can be a stripe-set(raid0) or a multipath or
> a raid1 or a plain block device or whatever.
> Having 'linear' at a different level to everything else seems a
> bit ugly, but it isn't really a big deal.
>

DM is also a framework where you can introduce completely new types of
block devices without having to go through the associated pain of
finding major numbers.  In terms of developing new things with greater
flexibility, I think it is easier.
 
> I would really like to see every 'dm' target being just a regular
> 'block' device.  Then a 'linear' block device could be used to
> assemble dm targets into a dm device.  Or the targets could be used
> directly if the 'linear' function wasn't needed.
> 
> Each target/device could respond to both dm ioctls and 'adhoc'
> ioctls.  That is a bit ugly, but backwards compatibility always is,
> but it isn't a big cost.
> 
> I think the way forward here is to put the important
> suspend/reconfig/resume functionality into the block layer, then
> work on making code work with multiple ioctl interfaces.
> 
> I *don't* think the way forward is to duplicate current block devices
> as dm targets.  This is duplication of effort (which I admit isn't
> always a bad thing) and a maintenance headache (which is).
> 

raid in dm aside (that's an entirely different debate ;), loop is a
pile of things which dm can nicely layer out into pieces (dm-crypt vs
loopback crypt).  Also, dm doesn't have to jump through hoops to get a
variable number of minors.

Yes, the loop side was recently improved for # of minors, and it does
have enough in there for userland to do variable number of minors, but
this is one specific case where dm is just easier.

At any rate, I'm all for ideas that make dm less of the evil stepchild
of the block layer ;)  I'm not saying everything should be dm, but I
did want to point out that dm-loop isn't entirely silly.

I have a version of Jens' patch in testing here that makes a new API
with the FS for mapping extents and hope to post it later today.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-11 Thread Mikulas Patocka

> So I looked at the code - it seems you build a full extent of the blocks
> in the file, filling holes as you go along. I initally did that as well,
> but that is to slow to be usable in real life.
> 
> You also don't support sparse files, falling back to normal fs
> read/write paths. Supporting sparse files properly is a must, people
> generally don't want to prealloc a huge disk backing.

How would you do sparse file support with passthrough loopback that 
doesn't use pagecache?

Holes are allocated at get_block function provided by each filesystem and 
the function gets a buffer that is supposed to be in the pagecache. Now if 
you want to allocate holes without pagecache, there's a problem --- new 
interface to all filesystems is needed.

It could be possible to use pagecache interface for filling holes and 
passthrough interface for other requests --- but get_block is allowed to 
move other blocks on the filesystem (and on UFS it really does), so 
calling get_block to fill a hole could move other unrelated blocks which 
would result in desychronized block map and corruption of both 
filesystems.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Fri, Jan 11 2008, Mikulas Patocka wrote:
> > So I looked at the code - it seems you build a full extent of the blocks
> > in the file, filling holes as you go along. I initally did that as well,
> > but that is to slow to be usable in real life.
> > 
> > You also don't support sparse files, falling back to normal fs
> > read/write paths. Supporting sparse files properly is a must, people
> > generally don't want to prealloc a huge disk backing.
> 
> How would you do sparse file support with passthrough loopback that 
> doesn't use pagecache?
> 
> Holes are allocated at get_block function provided by each filesystem and 
> the function gets a buffer that is supposed to be in the pagecache. Now if 
> you want to allocate holes without pagecache, there's a problem --- new 
> interface to all filesystems is needed.
> 
> It could be possible to use pagecache interface for filling holes and 
> passthrough interface for other requests --- but get_block is allowed to 
> move other blocks on the filesystem (and on UFS it really does), so 
> calling get_block to fill a hole could move other unrelated blocks which 
> would result in desychronized block map and corruption of both 
> filesystems.

Please read the posted patch and the posts from Chris as well, it
basically covers everything in your email.

The patch, as posted, doesn't work if the fs moves blocks around.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Bill Davidsen


Jens Axboe wrote:

Hi,

loop.c currently uses the page cache interface to do IO to file backed
devices. This works reasonably well for simple things, like mapping an
iso9660 file for direct mount and other read-only workloads. Writing is
somewhat problematic, as anyone who has really used this feature can
attest to - it tends to confuse the vm (hello kswapd) since it break
dirty accounting and behaves very erratically on writeout. Did I mention
that it's pretty slow as well, for both reads and writes?

Since you are looking for comments, I'll mention a loop-related behavior 
I've been seeing and see if it gets comments or is useful, since it can 
be used to tickle bad behavior on demand.


I have an 6GB sparse file, which I mount with cryptoloop and populate as 
an ext3 filesystem (more later on why). I then copy ~5.8GB of data to 
the filesystem, which is unmounted to be burnt to a DVD. Before it's 
burned the "dvdisaster" application is used to add some ECC information 
to the end, and make an image which fits on a DVD-DL. Media will be 
burned and distributed to multiple locations.


The problem:

When copying with rsync, the copy runs at ~25MB/s for a while, then 
falls into a pattern of bursts of 25MB/s followed by 10-15 sec of iowait 
with no disk activity. So I tried doing the copy by cpio

  find . -depth | cpio -pdm /mnt/loop
which shows exactly the same behavior. Then, for no good reason I tried
  find . -depth | cpio -pBdm /mnt/loop
and the copy ran at 25MB/s for the whole data set.

I was able to see similar results with a pure loop mount, I only mention 
the crypto for accuracy. Because many of these have been shipped over 
the last two years and new loop code would only be useful in this case 
if it were compatible so old data sets could be read.



It also behaves differently than a real drive. For writes, completions
are done once they hit page cache. Since loop queues bio's async and
hands them off to a thread, you can have a huge backlog of stuff to do.
It's hard to attempt to guarentee data safety for file systems on top of
loop without making it even slower than it currently is.

Back when loop was only used for iso9660 mounting and other simple
things, this didn't matter. Now it's often used in xen (and others)
setups where we do care about performance AND writing. So the below is a
attempt at speeding up loop and making it behave like a real device.
It's a somewhat quick hack and is still missing one piece to be
complete, but I'll throw it out there for people to play with and
comment on.

So how does it work? Instead of punting IO to a thread and passing it
through the page cache, we instead attempt to send the IO directly to the
filesystem block that it maps to. loop maintains a prio tree of known
extents in the file (populated lazily on demand, as needed). Advantages
of this approach:

- It's fast, loop will basically work at device speed.
- It's fast, loop it doesn't put a huge amount of system load on the
  system when busy. When I did comparison tests on my notebook with an
  external drive, running a simple tiobench on the current in-kernel
  loop with a sparse file backing rendered the notebook basically
  unusable while the test was ongoing. The remapper version had no more
  impact than it did when used directly on the external drive.
- It behaves like a real block device.
- It's easy to support IO barriers, which is needed to ensure safety
  especially in virtualized setups.

Disadvantages:

- The file block mappings must not change while loop is using the file.
  This means that we have to ensure exclusive access to the file and
  this is the bit that is currently missing in the implementation. It
  would be nice if we could just do this via open(), ideas welcome...
- It'll tie down a bit of memory for the prio tree. This is GREATLY
  offset by the reduced page cache foot print though.
- It cannot be used with the loop encryption stuff. dm-crypt should be
  used instead, on top of loop (which, I think, is even the recommended
  way to do this today, so not a big deal).



--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
> On Thu, Jan 10 2008, Chris Mason wrote:
> > On Thu, 10 Jan 2008 09:31:31 +0100
> > Jens Axboe <[EMAIL PROTECTED]> wrote:
> > 
> > > On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > > > Here's the latest version of dm-loop, for comparison.
> > > > 
> > > > To try it out, 
> > > >   ln -s dmsetup dmlosetup
> > > > and supply similar basic parameters to losetup.
> > > > (using dmsetup version 1.02.11 or higher)
> > > 
> > > Why oh why does dm always insist to reinvent everything? That's bad
> > > enough in itself, but on top of that most of the extra stuff ends up
> > > being essentially unmaintained.
> > 
> > I don't quite get how the dm version is reinventing things.  They use
> 
> Things like raid, now file mapping functionality. I'm sure there are
> more examples, it's how dm was always developed probably originating
> back to when they developed mostly out of tree. And I think it's a bad
> idea, we don't want duplicate functionality. If something is wrong with
> loop, fix it, don't write dm-loop.

I'm with Jens here.

We currently have two interfaces that interesting block devices can be
written for: 'dm' and 'block'.
We really should aim to have just one.  I would call it 'block' and
move anything really useful from dm into block.

As far as I can tell, the important things that 'dm' has that 'block'
doesn't have are:

  - a standard ioctl interface for assembling and creating interesting
 devices.
 For 'block', everybody just rolls there own. e.g. md, loop, and
 nbd all use totally different approaches for setup and tear down
 etc. 

  - suspend/reconfigure/resume.
This is something that I would really like to see in 'block'.  If
I had a filesystem mounted on /dev/sda1 and I wanted to make it a
raid1, it would be cool if I could
suspend /dev/sda1
build a raid1 from sda1 and something else
plug tha raid1 in as 'sda1'.
resume sda1

  - Integrated 'linear' mapping.
This is the bit of 'dm' that I think of as yucky.  If I read the
code correctly, every dm device is a linear array of a bunch of
targets.  Each target can be a stripe-set(raid0) or a multipath or
a raid1 or a plain block device or whatever.
Having 'linear' at a different level to everything else seems a
bit ugly, but it isn't really a big deal.

I would really like to see every 'dm' target being just a regular
'block' device.  Then a 'linear' block device could be used to
assemble dm targets into a dm device.  Or the targets could be used
directly if the 'linear' function wasn't needed.

Each target/device could respond to both dm ioctls and 'adhoc'
ioctls.  That is a bit ugly, but backwards compatibility always is,
but it isn't a big cost.

I think the way forward here is to put the important
suspend/reconfig/resume functionality into the block layer, then
work on making code work with multiple ioctl interfaces.

I *don't* think the way forward is to duplicate current block devices
as dm targets.  This is duplication of effort (which I admit isn't
always a bad thing) and a maintenance headache (which is).

 "Help, I'm having a problem with the loop driver"
 "Which one, dm or regular, because if it is the 'dm' one you have to
  ask over there ..."

 It has already happened occasionally with raid1 and multipath - both
 of which have pointless multiple implementations.

> 
> > the dmsetup command that they use for everything else and provide a
> > small and fairly clean module for bio specific loop instead of piling
> > it onto loop.c

I'm missing something here... "fairly clean module for bio specific
loop".  Isn't that what 'loop.c' is meant to be?

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Chris Mason

On Thu, 10 Jan 2008 14:03:24 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:

> On Thu, Jan 10 2008, Chris Mason wrote:
> > On Thu, 10 Jan 2008 08:54:59 +
> > Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> > 
> > > On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > > > > IMHO this shouldn't be done in the loop driver anyway.
> > > > > Filesystems have their own effricient extent lookup trees
> > > > > (well, at least xfs and btrfs do), and we should leverage
> > > > > that instead of reinventing it.
> > > > 
> > > > Completely agree, it's just needed right now for this solution
> > > > since all we have is a crappy bmap() interface to get at those
> > > > mappings.
> > > 
> > > So let's fix the interface instead of piling crap ontop of it.
> > > As I said I think Peter has something to start with so let's beat
> > > on it until we have something suitable.   If we aren't done by
> > > end of Feb I'm happy to host a hackfest to get it sorted around
> > > the fs/storage summit..
> > > 
> > 
> > Ok, I've been meaning to break my extent_map code up, and this is a
> > very good reason.  I'll work up a sample today based on Jens' code.
> 
> Great!

Grin, we'll see how the sample looks.

> 
> > The basic goals:
> > 
> > * Loop (swap) calls into the FS for each mapping. Any caching
> > happens on the FS side.
> > * The FS returns an extent, filling any holes
> 
> We don't want to fill holes for a read, but I guess that's a given?

Right.

> 
> > Swap would need to use an extra call early on for preallocation.
> > 
> > Step two is having a call back into the FS allow the FS to delay the
> > bios until commit completion so that COW and delalloc blocks can be
> > fully on disk when the bios are reported as done.  Jens, can you add
> > some way to queue the bio completions up?
> 
> Sure, a function to save a completed bio and a function to execute
> completions on those already stored?
> 

Sounds right, I'm mostly looking for a way to aggregate a few writes to
make the commits a little larger.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Thu, Jan 10 2008, Chris Mason wrote:
> On Thu, 10 Jan 2008 08:54:59 +
> Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > > > IMHO this shouldn't be done in the loop driver anyway.
> > > > Filesystems have their own effricient extent lookup trees (well,
> > > > at least xfs and btrfs do), and we should leverage that instead
> > > > of reinventing it.
> > > 
> > > Completely agree, it's just needed right now for this solution
> > > since all we have is a crappy bmap() interface to get at those
> > > mappings.
> > 
> > So let's fix the interface instead of piling crap ontop of it.  As I
> > said I think Peter has something to start with so let's beat on it
> > until we have something suitable.   If we aren't done by end of Feb
> > I'm happy to host a hackfest to get it sorted around the fs/storage
> > summit..
> > 
> 
> Ok, I've been meaning to break my extent_map code up, and this is a
> very good reason.  I'll work up a sample today based on Jens' code.

Great!

> The basic goals:
> 
> * Loop (swap) calls into the FS for each mapping. Any caching happens
> on the FS side.
> * The FS returns an extent, filling any holes

We don't want to fill holes for a read, but I guess that's a given?

> Swap would need to use an extra call early on for preallocation.
> 
> Step two is having a call back into the FS allow the FS to delay the
> bios until commit completion so that COW and delalloc blocks can be
> fully on disk when the bios are reported as done.  Jens, can you add
> some way to queue the bio completions up?

Sure, a function to save a completed bio and a function to execute
completions on those already stored?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Thu, Jan 10 2008, Chris Mason wrote:
> On Thu, 10 Jan 2008 09:31:31 +0100
> Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > > Here's the latest version of dm-loop, for comparison.
> > > 
> > > To try it out, 
> > >   ln -s dmsetup dmlosetup
> > > and supply similar basic parameters to losetup.
> > > (using dmsetup version 1.02.11 or higher)
> > 
> > Why oh why does dm always insist to reinvent everything? That's bad
> > enough in itself, but on top of that most of the extra stuff ends up
> > being essentially unmaintained.
> 
> I don't quite get how the dm version is reinventing things.  They use

Things like raid, now file mapping functionality. I'm sure there are
more examples, it's how dm was always developed probably originating
back to when they developed mostly out of tree. And I think it's a bad
idea, we don't want duplicate functionality. If something is wrong with
loop, fix it, don't write dm-loop.

> the dmsetup command that they use for everything else and provide a
> small and fairly clean module for bio specific loop instead of piling
> it onto loop.c

If loop.c is a problem, I'd rather see a newloop.c (with a better name,
of course) that we can transition to.

> Their code doesn't have the fancy hole handling that yours does, but
> neither did yours 4 days ago ;)

Well mine didn't exist 4 days ago, I was just listing missing
functionality.

> 
> > 
> > If we instead improve loop, everyone wins.
> > 
> > Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a
> > bit outside your own sandbox.
> > 
> 
> It is a natural fit in either place, as both loop and dm have a good
> infrastructure for it.  I'm not picky about where it ends up, but dm
> wouldn't be a bad place.

I know that's your opinion, I reserve the right to have my own on where
this functionality belongs :)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Chris Mason

On Thu, 10 Jan 2008 08:54:59 +
Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > > IMHO this shouldn't be done in the loop driver anyway.
> > > Filesystems have their own effricient extent lookup trees (well,
> > > at least xfs and btrfs do), and we should leverage that instead
> > > of reinventing it.
> > 
> > Completely agree, it's just needed right now for this solution
> > since all we have is a crappy bmap() interface to get at those
> > mappings.
> 
> So let's fix the interface instead of piling crap ontop of it.  As I
> said I think Peter has something to start with so let's beat on it
> until we have something suitable.   If we aren't done by end of Feb
> I'm happy to host a hackfest to get it sorted around the fs/storage
> summit..
> 

Ok, I've been meaning to break my extent_map code up, and this is a
very good reason.  I'll work up a sample today based on Jens' code.

The basic goals:

* Loop (swap) calls into the FS for each mapping. Any caching happens
on the FS side.
* The FS returns an extent, filling any holes

Swap would need to use an extra call early on for preallocation.

Step two is having a call back into the FS allow the FS to delay the
bios until commit completion so that COW and delalloc blocks can be
fully on disk when the bios are reported as done.  Jens, can you add
some way to queue the bio completions up?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Chris Mason

On Thu, 10 Jan 2008 09:31:31 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:

> On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > Here's the latest version of dm-loop, for comparison.
> > 
> > To try it out, 
> >   ln -s dmsetup dmlosetup
> > and supply similar basic parameters to losetup.
> > (using dmsetup version 1.02.11 or higher)
> 
> Why oh why does dm always insist to reinvent everything? That's bad
> enough in itself, but on top of that most of the extra stuff ends up
> being essentially unmaintained.

I don't quite get how the dm version is reinventing things.  They use
the dmsetup command that they use for everything else and provide a
small and fairly clean module for bio specific loop instead of piling
it onto loop.c

Their code doesn't have the fancy hole handling that yours does, but
neither did yours 4 days ago ;)

> 
> If we instead improve loop, everyone wins.
> 
> Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a
> bit outside your own sandbox.
> 

It is a natural fit in either place, as both loop and dm have a good
infrastructure for it.  I'm not picky about where it ends up, but dm
wouldn't be a bad place.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Peter Zijlstra


On Thu, 2008-01-10 at 11:02 +0100, Jens Axboe wrote:
> On Thu, Jan 10 2008, Peter Zijlstra wrote:
> > 
> > On Thu, 2008-01-10 at 10:49 +0100, Jens Axboe wrote:
> > > On Thu, Jan 10 2008, Peter Zijlstra wrote:
> > > > 
> > > > On Thu, 2008-01-10 at 08:37 +, Christoph Hellwig wrote:
> > > > 
> > > > > Peter, any chance you could chime in here?
> > > > 
> > > > I have this patch to add swap_out/_in methods. I expect we can loosen
> > > > the requirement for swapcache pages and change the name a little.
> > > > 
> > > > previously posted here:
> > > >   http://lkml.org/lkml/2007/5/4/143
> > > > 
> > > > --- 
> > > > Subject: mm: add support for non block device backed swap files
> > > > 
> > > > New addres_space_operations methods are added:
> > > >   int swapfile(struct address_space *, int);
> > > >   int swap_out(struct file *, struct page *, struct writeback_control 
> > > > *);
> > > >   int swap_in(struct file *, struct page *);
> > > > 
> > > > When during sys_swapon() the swapfile() method is found and returns no 
> > > > error
> > > > the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, 
> > > > and
> > > > make use of swap_{out,in}() to write/read swapcache pages.
> > > > 
> > > > The swapfile method will be used to communicate to the address_space 
> > > > that the
> > > > VM relies on it, and the address_space should take adequate measures 
> > > > (like 
> > > > reserving memory for mempools or the like).
> > > > 
> > > > This new interface can be used to obviate the need for ->bmap in the 
> > > > swapfile
> > > > code. A filesystem would need to load (and maybe even allocate) the 
> > > > full block
> > > > map for a file into memory and pin it there on ->swapfile(,1) so that
> > > > ->swap_{out,in}() have instant access to it. It can be released on
> > > > ->swapfile(,0).
> > > 
> > > So this is where I don't think that's good enough, you cannot require a
> > > full block/extent mapping of a file on setup. It can take quite some
> > > time, a little testing I did here easily took 5 seconds for only a
> > > couple of gigabytes. And that wasn't even worst case for that size. It
> > > also wastes memory by populating extents that we may never read or
> > > write.
> > > 
> > > If you look at the loop addition I did, it populates lazily as needed
> > > with some very simple logic to populate-ahead. In practice that performs
> > > as well as a pre-populated map, the first IO to a given range will just
> > > be a little slower since we have to bmap() it.
> > > 
> > > Do you have plans to improve this area?
> > 
> > Nope, for swap it _must_ be there, there is just no way we can do block
> > allocation on swapout.
> 
> I appreciate that fact, just saying that we could have more flexibility
> for other uses.
> 
> > That said, the swapfile() interface can be used to pre-populate the
> > extend/block mapping, and when using swap_(in/out) without it, it can be
> > done lazily.
> 
> I think the interface is too simple, I would much rather have a way to
> dig into the file mappings and be allowed to request population and so
> on. Without that, it can't be used for eg loop.

I'm open to suggestions :-), if you can propose an interface we both can
use I'm all ears.

Although poking at the extends sounds like bmap and we were just trying
to get rid of that.

Perhaps something like badvise(), where you can suggest loading/dropping
extend information on a voluntary basis.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Thu, Jan 10 2008, Peter Zijlstra wrote:
> 
> On Thu, 2008-01-10 at 10:49 +0100, Jens Axboe wrote:
> > On Thu, Jan 10 2008, Peter Zijlstra wrote:
> > > 
> > > On Thu, 2008-01-10 at 08:37 +, Christoph Hellwig wrote:
> > > 
> > > > Peter, any chance you could chime in here?
> > > 
> > > I have this patch to add swap_out/_in methods. I expect we can loosen
> > > the requirement for swapcache pages and change the name a little.
> > > 
> > > previously posted here:
> > >   http://lkml.org/lkml/2007/5/4/143
> > > 
> > > --- 
> > > Subject: mm: add support for non block device backed swap files
> > > 
> > > New addres_space_operations methods are added:
> > >   int swapfile(struct address_space *, int);
> > >   int swap_out(struct file *, struct page *, struct writeback_control *);
> > >   int swap_in(struct file *, struct page *);
> > > 
> > > When during sys_swapon() the swapfile() method is found and returns no 
> > > error
> > > the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, 
> > > and
> > > make use of swap_{out,in}() to write/read swapcache pages.
> > > 
> > > The swapfile method will be used to communicate to the address_space that 
> > > the
> > > VM relies on it, and the address_space should take adequate measures 
> > > (like 
> > > reserving memory for mempools or the like).
> > > 
> > > This new interface can be used to obviate the need for ->bmap in the 
> > > swapfile
> > > code. A filesystem would need to load (and maybe even allocate) the full 
> > > block
> > > map for a file into memory and pin it there on ->swapfile(,1) so that
> > > ->swap_{out,in}() have instant access to it. It can be released on
> > > ->swapfile(,0).
> > 
> > So this is where I don't think that's good enough, you cannot require a
> > full block/extent mapping of a file on setup. It can take quite some
> > time, a little testing I did here easily took 5 seconds for only a
> > couple of gigabytes. And that wasn't even worst case for that size. It
> > also wastes memory by populating extents that we may never read or
> > write.
> > 
> > If you look at the loop addition I did, it populates lazily as needed
> > with some very simple logic to populate-ahead. In practice that performs
> > as well as a pre-populated map, the first IO to a given range will just
> > be a little slower since we have to bmap() it.
> > 
> > Do you have plans to improve this area?
> 
> Nope, for swap it _must_ be there, there is just no way we can do block
> allocation on swapout.

I appreciate that fact, just saying that we could have more flexibility
for other uses.

> That said, the swapfile() interface can be used to pre-populate the
> extend/block mapping, and when using swap_(in/out) without it, it can be
> done lazily.

I think the interface is too simple, I would much rather have a way to
dig into the file mappings and be allowed to request population and so
on. Without that, it can't be used for eg loop.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Peter Zijlstra


On Thu, 2008-01-10 at 10:49 +0100, Jens Axboe wrote:
> On Thu, Jan 10 2008, Peter Zijlstra wrote:
> > 
> > On Thu, 2008-01-10 at 08:37 +, Christoph Hellwig wrote:
> > 
> > > Peter, any chance you could chime in here?
> > 
> > I have this patch to add swap_out/_in methods. I expect we can loosen
> > the requirement for swapcache pages and change the name a little.
> > 
> > previously posted here:
> >   http://lkml.org/lkml/2007/5/4/143
> > 
> > --- 
> > Subject: mm: add support for non block device backed swap files
> > 
> > New addres_space_operations methods are added:
> >   int swapfile(struct address_space *, int);
> >   int swap_out(struct file *, struct page *, struct writeback_control *);
> >   int swap_in(struct file *, struct page *);
> > 
> > When during sys_swapon() the swapfile() method is found and returns no error
> > the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
> > make use of swap_{out,in}() to write/read swapcache pages.
> > 
> > The swapfile method will be used to communicate to the address_space that 
> > the
> > VM relies on it, and the address_space should take adequate measures (like 
> > reserving memory for mempools or the like).
> > 
> > This new interface can be used to obviate the need for ->bmap in the 
> > swapfile
> > code. A filesystem would need to load (and maybe even allocate) the full 
> > block
> > map for a file into memory and pin it there on ->swapfile(,1) so that
> > ->swap_{out,in}() have instant access to it. It can be released on
> > ->swapfile(,0).
> 
> So this is where I don't think that's good enough, you cannot require a
> full block/extent mapping of a file on setup. It can take quite some
> time, a little testing I did here easily took 5 seconds for only a
> couple of gigabytes. And that wasn't even worst case for that size. It
> also wastes memory by populating extents that we may never read or
> write.
> 
> If you look at the loop addition I did, it populates lazily as needed
> with some very simple logic to populate-ahead. In practice that performs
> as well as a pre-populated map, the first IO to a given range will just
> be a little slower since we have to bmap() it.
> 
> Do you have plans to improve this area?

Nope, for swap it _must_ be there, there is just no way we can do block
allocation on swapout.

That said, the swapfile() interface can be used to pre-populate the
extend/block mapping, and when using swap_(in/out) without it, it can be
done lazily.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Thu, Jan 10 2008, Peter Zijlstra wrote:
> 
> On Thu, 2008-01-10 at 08:37 +, Christoph Hellwig wrote:
> 
> > Peter, any chance you could chime in here?
> 
> I have this patch to add swap_out/_in methods. I expect we can loosen
> the requirement for swapcache pages and change the name a little.
> 
> previously posted here:
>   http://lkml.org/lkml/2007/5/4/143
> 
> --- 
> Subject: mm: add support for non block device backed swap files
> 
> New addres_space_operations methods are added:
>   int swapfile(struct address_space *, int);
>   int swap_out(struct file *, struct page *, struct writeback_control *);
>   int swap_in(struct file *, struct page *);
> 
> When during sys_swapon() the swapfile() method is found and returns no error
> the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
> make use of swap_{out,in}() to write/read swapcache pages.
> 
> The swapfile method will be used to communicate to the address_space that the
> VM relies on it, and the address_space should take adequate measures (like 
> reserving memory for mempools or the like).
> 
> This new interface can be used to obviate the need for ->bmap in the swapfile
> code. A filesystem would need to load (and maybe even allocate) the full block
> map for a file into memory and pin it there on ->swapfile(,1) so that
> ->swap_{out,in}() have instant access to it. It can be released on
> ->swapfile(,0).

So this is where I don't think that's good enough, you cannot require a
full block/extent mapping of a file on setup. It can take quite some
time, a little testing I did here easily took 5 seconds for only a
couple of gigabytes. And that wasn't even worst case for that size. It
also wastes memory by populating extents that we may never read or
write.

If you look at the loop addition I did, it populates lazily as needed
with some very simple logic to populate-ahead. In practice that performs
as well as a pre-populated map, the first IO to a given range will just
be a little slower since we have to bmap() it.

Do you have plans to improve this area?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Peter Zijlstra


On Thu, 2008-01-10 at 08:37 +, Christoph Hellwig wrote:

> Peter, any chance you could chime in here?

I have this patch to add swap_out/_in methods. I expect we can loosen
the requirement for swapcache pages and change the name a little.

previously posted here:
  http://lkml.org/lkml/2007/5/4/143

--- 
Subject: mm: add support for non block device backed swap files

New addres_space_operations methods are added:
  int swapfile(struct address_space *, int);
  int swap_out(struct file *, struct page *, struct writeback_control *);
  int swap_in(struct file *, struct page *);

When during sys_swapon() the swapfile() method is found and returns no error
the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
make use of swap_{out,in}() to write/read swapcache pages.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

This new interface can be used to obviate the need for ->bmap in the swapfile
code. A filesystem would need to load (and maybe even allocate) the full block
map for a file into memory and pin it there on ->swapfile(,1) so that
->swap_{out,in}() have instant access to it. It can be released on
->swapfile(,0).

The reason to provide ->swap_{out,in}() over using {write,read}page() is to
 1) make a distinction between swapcache and pagecache pages, and
 2) to provide a struct file * for credential context (normally not needed
in the context of writepage, as the page content is normally dirtied
using either of the following interfaces:
  write_{begin,end}()
  {prepare,commit}_write()
  page_mkwrite()
which do have the file context.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 Documentation/filesystems/Locking |   19 
 Documentation/filesystems/vfs.txt |   17 +++
 include/linux/buffer_head.h   |2 -
 include/linux/fs.h|8 +
 include/linux/swap.h  |3 +
 mm/Kconfig|3 +
 mm/page_io.c  |   58 ++
 mm/swap_state.c   |5 +++
 mm/swapfile.c |   22 +-
 9 files changed, 135 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/swap.h
===
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -164,6 +164,7 @@ enum {
SWP_USED= (1 << 0), /* is slot in swap_info[] used? */
SWP_WRITEOK = (1 << 1), /* ok to write to this swap?*/
SWP_ACTIVE  = (SWP_USED | SWP_WRITEOK),
+   SWP_FILE= (1 << 2), /* file swap area */
/* add others here before... */
SWP_SCANNING= (1 << 8), /* refcount in scan_swap_map */
 };
@@ -261,6 +262,8 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern void end_swap_bio_read(struct bio *bio, int err);
 
 /* linux/mm/swap_state.c */
Index: linux-2.6/mm/page_io.c
===
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -102,6 +103,18 @@ int swap_writepage(struct page *page, st
unlock_page(page);
goto out;
}
+#ifdef CONFIG_SWAP_FILE
+   {
+   struct swap_info_struct *sis = page_swap_info(page);
+   if (sis->flags & SWP_FILE) {
+   ret = sis->swap_file->f_mapping->
+   a_ops->swap_out(sis->swap_file, page, wbc);
+   if (!ret)
+   count_vm_event(PSWPOUT);
+   return ret;
+   }
+   }
+#endif
bio = get_swap_bio(GFP_NOIO, page_private(page), page,
end_swap_bio_write);
if (bio == NULL) {
@@ -120,6 +133,39 @@ out:
return ret;
 }
 
+#ifdef CONFIG_SWAP_FILE
+void swap_sync_page(struct page *page)
+{
+   struct swap_info_struct *sis = page_swap_info(page);
+
+   if (sis->flags & SWP_FILE) {
+   const struct address_space_operations * a_ops =
+   sis->swap_file->f_mapping->a_ops;
+   if (a_ops->sync_page)
+   a_ops->sync_page(page);
+   } else
+   block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+   struct swap_info_struct *sis = page_swap_info(page);
+
+   if (

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Thu, Jan 10 2008, Christoph Hellwig wrote:
> On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > > IMHO this shouldn't be done in the loop driver anyway.  Filesystems have
> > > their own effricient extent lookup trees (well, at least xfs and btrfs
> > > do), and we should leverage that instead of reinventing it.
> > 
> > Completely agree, it's just needed right now for this solution since all
> > we have is a crappy bmap() interface to get at those mappings.
> 
> So let's fix the interface instead of piling crap ontop of it.  As I
> said I think Peter has something to start with so let's beat on it
> until we have something suitable. 

Sure, I'm all for doing it the Right Way. I wasn't aware of anything
Peter was doing in this area, so lets please see it.

It's not like opportunities to improve this haven't been around. My plan
was/is to convert to using the get_block() tricks of O_DIRECT, one could
easily argue that the work should have been done then. And perhaps
direct-io.c wouldn't be such a steaming pile of crap if it had been
done, and loop would already be fine since we could have tapped into
that.

> If we aren't done by end of Feb I'm happy to host a hackfest to get it
> sorted around the fs/storage summit..

Count me in :)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Christoph Hellwig

On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote:
> > IMHO this shouldn't be done in the loop driver anyway.  Filesystems have
> > their own effricient extent lookup trees (well, at least xfs and btrfs
> > do), and we should leverage that instead of reinventing it.
> 
> Completely agree, it's just needed right now for this solution since all
> we have is a crappy bmap() interface to get at those mappings.

So let's fix the interface instead of piling crap ontop of it.  As I
said I think Peter has something to start with so let's beat on it
until we have something suitable.   If we aren't done by end of Feb
I'm happy to host a hackfest to get it sorted around the fs/storage
summit..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Thu, Jan 10 2008, Christoph Hellwig wrote:
> > > loop maintains a prio tree of known 
> > > extents in the file (populated lazily on demand, as needed).
> > 
> > Just a quick question (I haven't looked closely at the code): how come
> > you are using a prio tree for extents? I don't think they could be
> > overlapping?
> 
> IMHO this shouldn't be done in the loop driver anyway.  Filesystems have
> their own effricient extent lookup trees (well, at least xfs and btrfs
> do), and we should leverage that instead of reinventing it.

Completely agree, it's just needed right now for this solution since all
we have is a crappy bmap() interface to get at those mappings.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Wed, Jan 09 2008, Andi Kleen wrote:
> Jens Axboe <[EMAIL PROTECTED]> writes:
> >
> > So how does it work? Instead of punting IO to a thread and passing it
> > through the page cache, we instead attempt to send the IO directly to the
> 
> Great -- something like this was needed for a long time.
> 
> > - The file block mappings must not change while loop is using the file.
> >   This means that we have to ensure exclusive access to the file and
> >   this is the bit that is currently missing in the implementation. It
> >   would be nice if we could just do this via open(), ideas welcome...
> 
> get_write_access()/put_write_access() will block other writers.
> 
> But as pointed out by others that is not enough for this.

Yeah, basically allowing O_RDONLY | O_DIRECT opens should be ok, but we
can't allow writes and we can't allow page cache to exist for this file
outside of loop.

> I suppose you could use a white list like a special flag for file systems 
> (like ext2/ext3) that do not reallocate blocks.

Irk, but yeah we probably need something like that for now until Chris
proposes his API addition.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Thu, Jan 10 2008, Jens Axboe wrote:
> On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> > Here's the latest version of dm-loop, for comparison.
> > 
> > To try it out, 
> >   ln -s dmsetup dmlosetup
> > and supply similar basic parameters to losetup.
> > (using dmsetup version 1.02.11 or higher)
> 
> Why oh why does dm always insist to reinvent everything? That's bad
> enough in itself, but on top of that most of the extra stuff ends up
> being essentially unmaintained.
> 
> If we instead improve loop, everyone wins.
> 
> Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit
> outside your own sandbox.

So I looked at the code - it seems you build a full extent of the blocks
in the file, filling holes as you go along. I initally did that as well,
but that is to slow to be usable in real life.

You also don't support sparse files, falling back to normal fs
read/write paths. Supporting sparse files properly is a must, people
generally don't want to prealloc a huge disk backing.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Christoph Hellwig

On Thu, Jan 10, 2008 at 12:42:25PM +1100, Nick Piggin wrote:
> > So how does it work? Instead of punting IO to a thread and passing it
> > through the page cache, we instead attempt to send the IO directly to the
> > filesystem block that it maps to.
> 
> You told Christoph that just using direct-IO from kernel still doesn't
> give you the required behaviour... What about queueing the IO directly
> *and* using direct-IO? I guess it still has to go through the underlying
> filesystem, but that's probably a good thing.

We defintively need to go through the filesystem for I/O submission,
and also for I/O completion.  Thinking of the async submission might be
what Peter actually implemented for his network swapping patches as
you really wouldn't want to write it out synchronously.


Peter, any chance you could chime in here?
> 
> > loop maintains a prio tree of known 
> > extents in the file (populated lazily on demand, as needed).
> 
> Just a quick question (I haven't looked closely at the code): how come
> you are using a prio tree for extents? I don't think they could be
> overlapping?

IMHO this shouldn't be done in the loop driver anyway.  Filesystems have
their own effricient extent lookup trees (well, at least xfs and btrfs
do), and we should leverage that instead of reinventing it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Thu, Jan 10 2008, Nick Piggin wrote:
> On Wednesday 09 January 2008 19:52, Jens Axboe wrote:
> 
> > So how does it work? Instead of punting IO to a thread and passing it
> > through the page cache, we instead attempt to send the IO directly to the
> > filesystem block that it maps to.
> 
> You told Christoph that just using direct-IO from kernel still doesn't
> give you the required behaviour... What about queueing the IO directly
> *and* using direct-IO? I guess it still has to go through the underlying
> filesystem, but that's probably a good thing.

If it was O_DIRECT and aio, then we would be close.

> > loop maintains a prio tree of known 
> > extents in the file (populated lazily on demand, as needed).
> 
> Just a quick question (I haven't looked closely at the code): how come
> you are using a prio tree for extents? I don't think they could be
> overlapping?

Because I'm really lazy - the core of this was basically first written
as a quick hack and then I always go shopping for reusable data
structures. prio trees fit the bill nicely, they described extents and
allowed loopup with a key anywhere in that extent. You are right in that
I don't need the overlap handling at all, and Chris already tried to
talk me into reusing his btrfs extent code :-)

So I may just do the latter, turning it into a lib/extent-map.c in the
longer run. My first priority was just having something that worked so I
could test it. At the end of the day, not a single soul would ever
notice if the prio tree ended up being slightly slower than a custom
solution.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-10 Thread Jens Axboe

On Wed, Jan 09 2008, Alasdair G Kergon wrote:
> Here's the latest version of dm-loop, for comparison.
> 
> To try it out, 
>   ln -s dmsetup dmlosetup
> and supply similar basic parameters to losetup.
> (using dmsetup version 1.02.11 or higher)

Why oh why does dm always insist to reinvent everything? That's bad
enough in itself, but on top of that most of the extra stuff ends up
being essentially unmaintained.

If we instead improve loop, everyone wins.

Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit
outside your own sandbox.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Nick Piggin

On Wednesday 09 January 2008 19:52, Jens Axboe wrote:

> So how does it work? Instead of punting IO to a thread and passing it
> through the page cache, we instead attempt to send the IO directly to the
> filesystem block that it maps to.

You told Christoph that just using direct-IO from kernel still doesn't
give you the required behaviour... What about queueing the IO directly
*and* using direct-IO? I guess it still has to go through the underlying
filesystem, but that's probably a good thing.

> loop maintains a prio tree of known 
> extents in the file (populated lazily on demand, as needed).

Just a quick question (I haven't looked closely at the code): how come
you are using a prio tree for extents? I don't think they could be
overlapping?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Alasdair G Kergon

On Thu, Jan 10, 2008 at 12:43:19AM +0100, [EMAIL PROTECTED] wrote:
> oh, nice to see that this is still alive.

> at least i got no crashes and was able to mount and acess more than 300 
> iso-images with that.

> were there fixes/chances since then?

Little has changed for some time - mostly code cleanups and the occasional bug 
fix.

It's time to give it wider exposure, I think, and we'll find out how well it 
holds up.

Alasdair
-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread devzero

oh, nice to see that this is still alive.

i tried this around half a year ago because i needed more than 256 loop devices 
and iirc, this was working quite fine.
at least i got no crashes and was able to mount and acess more than 300 
iso-images with that.

shortly after, loop device was extended to handle  a larger number of 
loop-devices and i went that way because dm-loop was not in mainline.

i have taken a look at the wiki at http://sources.redhat.com/lvm2/wiki/DMLoop 
from time to time, but there didn`t seem to happen much.

were there fixes/chances since then?

do you think it`s ready for mainline?



>Here's the latest version of dm-loop, for comparison.
>
>To try it out, 
>  ln -s dmsetup dmlosetup
>and supply similar basic parameters to losetup.
>(using dmsetup version 1.02.11 or higher)
>
>Alasdair
>
>From: Bryn Reeves <[EMAIL PROTECTED]>
>
>This implements a loopback target for device mapper allowing a regular
>file to be treated as a block device.
>
>Signed-off-by: Bryn Reeves <[EMAIL PROTECTED]>
>Signed-off-by: Alasdair G Kergon <[EMAIL PROTECTED]>

__
Erweitern Sie FreeMail zu einem noch leistungsstärkeren E-Mail-Postfach!

Mehr Infos unter http://produkte.web.de/club/?mc=021131

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Alasdair G Kergon

Here's the latest version of dm-loop, for comparison.

To try it out, 
  ln -s dmsetup dmlosetup
and supply similar basic parameters to losetup.
(using dmsetup version 1.02.11 or higher)

Alasdair

From: Bryn Reeves <[EMAIL PROTECTED]>

This implements a loopback target for device mapper allowing a regular
file to be treated as a block device.

Signed-off-by: Bryn Reeves <[EMAIL PROTECTED]>
Signed-off-by: Alasdair G Kergon <[EMAIL PROTECTED]>

---
 drivers/md/Kconfig   |9 
 drivers/md/Makefile  |1 
 drivers/md/dm-loop.c | 1036 +++
 3 files changed, 1046 insertions(+)

Index: linux-2.6.24-rc6/drivers/md/Kconfig
===
--- linux-2.6.24-rc6.orig/drivers/md/Kconfig2008-01-07 18:32:05.0 
+
+++ linux-2.6.24-rc6/drivers/md/Kconfig 2008-01-07 18:33:12.0 +
@@ -229,6 +229,15 @@ config DM_CRYPT
 
  If unsure, say N.
 
+config DM_LOOP
+   tristate "Loop target (EXPERIMENTAL)"
+   depends on BLK_DEV_DM && EXPERIMENTAL
+   ---help---
+ This device-mapper target allows you to treat a regular file as
+ a block device.
+
+ If unsure, say N.
+
 config DM_SNAPSHOT
tristate "Snapshot target (EXPERIMENTAL)"
depends on BLK_DEV_DM && EXPERIMENTAL
Index: linux-2.6.24-rc6/drivers/md/Makefile
===
--- linux-2.6.24-rc6.orig/drivers/md/Makefile   2008-01-07 18:32:05.0 
+
+++ linux-2.6.24-rc6/drivers/md/Makefile2008-01-07 18:33:12.0 
+
@@ -34,6 +34,7 @@ obj-$(CONFIG_BLK_DEV_MD)  += md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)   += dm-mod.o
 obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
 obj-$(CONFIG_DM_DELAY) += dm-delay.o
+obj-$(CONFIG_DM_LOOP)  += dm-loop.o
 obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_EMC) += dm-emc.o
 obj-$(CONFIG_DM_MULTIPATH_HP)  += dm-hp-sw.o
Index: linux-2.6.24-rc6/drivers/md/dm-loop.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.24-rc6/drivers/md/dm-loop.c   2008-01-07 18:33:12.0 
+
@@ -0,0 +1,1036 @@
+/*
+ * Copyright (C) 2006-2008 Red Hat, Inc. All rights reserved.
+ *
+ * This file is part of device-mapper.
+ *
+ * Extent mapping implementation heavily influenced by mm/swapfile.c
+ * Bryn Reeves <[EMAIL PROTECTED]>
+ *
+ * File mapping and block lookup algorithms support by
+ * Heinz Mauelshagen <[EMAIL PROTECTED]>.
+ *
+ * This file is released under the GPL.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "dm.h"
+#include "dm-bio-list.h"
+
+#define DM_LOOP_DAEMON "kloopd"
+#define DM_MSG_PREFIX "loop"
+
+enum flags { DM_LOOP_BMAP, DM_LOOP_FSIO };
+
+/*
+ * Loop context
+ **/
+
+struct loop_c {
+   unsigned long flags;
+
+   /* Backing store */
+
+   struct file *filp;
+   char *path;
+   loff_t offset;
+   struct block_device *bdev;
+   unsigned blkbits;   /* file system block size shift bits */
+
+   loff_t size;/* size of entire file in bytes */
+   loff_t blocks;  /* blocks allocated to loop file */
+   sector_t mapped_sectors;/* size of mapped area in sectors */
+
+   int (*map_fn)(struct dm_target *, struct bio *);
+   void *map_data;
+};
+
+/*
+ * Block map extent
+ */
+struct dm_loop_extent {
+   sector_t start; /* start sector in mapped device */
+   sector_t to;/* start sector on target device */
+   sector_t len;   /* length in sectors */
+};
+
+/*
+ * Temporary extent list
+ */
+struct extent_list {
+   struct dm_loop_extent *extent;
+   struct list_head list;
+};
+
+static struct kmem_cache *dm_loop_extent_cache;
+
+/*
+ * Block map private context
+ */
+struct block_map_c {
+   int nr_extents; /* number of extents in map */
+   struct dm_loop_extent **map;/* linear map of extent pointers */
+   struct dm_loop_extent **mru;/* pointer to mru entry */
+   spinlock_t mru_lock;/* protects mru */
+};
+
+/*
+ * File map private context
+ */
+struct file_map_c {
+   spinlock_t lock;/* protects in */
+   struct bio_list in; /* new bios for processing */
+   struct bio_list work;   /* bios queued for processing */
+   struct workqueue_struct *wq;/* workqueue */
+   struct work_struct ws;  /* loop work */
+   struct loop_c *loop;/* for filp & offset */
+};
+
+/*
+ * Gene

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Andi Kleen

Jens Axboe <[EMAIL PROTECTED]> writes:
>
> So how does it work? Instead of punting IO to a thread and passing it
> through the page cache, we instead attempt to send the IO directly to the

Great -- something like this was needed for a long time.

> - The file block mappings must not change while loop is using the file.
>   This means that we have to ensure exclusive access to the file and
>   this is the bit that is currently missing in the implementation. It
>   would be nice if we could just do this via open(), ideas welcome...

get_write_access()/put_write_access() will block other writers.

But as pointed out by others that is not enough for this.

I suppose you could use a white list like a special flag for file systems 
(like ext2/ext3) that do not reallocate blocks.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Chris Mason

On Wed, 9 Jan 2008 10:43:21 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:

> On Wed, Jan 09 2008, Christoph Hellwig wrote:
> > On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
> > > - The file block mappings must not change while loop is using the
> > > file. This means that we have to ensure exclusive access to the
> > > file and this is the bit that is currently missing in the
> > > implementation. It would be nice if we could just do this via
> > > open(), ideas welcome...
> > 
> > And the way this is done is simply broken.  It means you have to get
> > rid of things like delayed or unwritten hands beforehand, it'll be
> > a complete pain for COW or non-block backed filesystems.
> 
> COW is not that hard to handle, you just need to be notified of moving
> blocks. If you view the patch as just a tighter integration between
> loop and fs, I don't think it's necessarily that broken.
> 
Filling holes (delayed allocation) and COW are definitely a problem.
But at least for the loop use case, most non-cow filesystems will want
to preallocate the space for loop file and be done with it.  Sparse
loop definitely has uses, but generally those users are willing to pay
a little performance.

Jens' patch falls back to buffered writes for the hole case and
pretends cow doesn't exist.  It's a good starting point that I hope to
extend with something like the extent_map apis.

> I did consider these cases, and it can be done with the existing
> approach.
> 
> > The right way to do this is to allow direct I/O from kernel sources
> > where the filesystem is in-charge of submitting the actual I/O after
> > the pages are handed to it.  I think Peter Zijlstra has been looking
> > into something like that for swap over nfs.
> 
> That does sound like a nice approach, but a lot more work. It'll
> behave differently too, the advantage of what I proposed is that it
> behaves like a real device.

The problem with O_DIRECT (or even O_SYNC) loop is that every write
into loop becomes synchronous, and it really changes the performance of
things like filemap_fdatawrite.

If we just hand ownership of the file over to loop entirely and prevent
other openers (perhaps even forcing backups through the loop device),
we get fewer corner cases and much better performance.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Jens Axboe

On Wed, Jan 09 2008, Christoph Hellwig wrote:
> On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
> > - The file block mappings must not change while loop is using the file.
> >   This means that we have to ensure exclusive access to the file and
> >   this is the bit that is currently missing in the implementation. It
> >   would be nice if we could just do this via open(), ideas welcome...
> 
> And the way this is done is simply broken.  It means you have to get
> rid of things like delayed or unwritten hands beforehand, it'll be
> a complete pain for COW or non-block backed filesystems.

COW is not that hard to handle, you just need to be notified of moving
blocks. If you view the patch as just a tighter integration between loop
and fs, I don't think it's necessarily that broken.

I did consider these cases, and it can be done with the existing
approach.

> The right way to do this is to allow direct I/O from kernel sources
> where the filesystem is in-charge of submitting the actual I/O after
> the pages are handed to it.  I think Peter Zijlstra has been looking
> into something like that for swap over nfs.

That does sound like a nice approach, but a lot more work. It'll behave
differently too, the advantage of what I proposed is that it behaves
like a real device.

I'm not asking you to love it (in fact I knew some people would complain
about this approach and I understand why), just tossing it out there to
get things rolling. If we end up doing it differently I don't really
care, I'm not married to any solution but merely wish to solve a
problem. If that ends up being solved differently, the outcome is the
same to me.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Christoph Hellwig

On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
> - The file block mappings must not change while loop is using the file.
>   This means that we have to ensure exclusive access to the file and
>   this is the bit that is currently missing in the implementation. It
>   would be nice if we could just do this via open(), ideas welcome...

And the way this is done is simply broken.  It means you have to get
rid of things like delayed or unwritten hands beforehand, it'll be
a complete pain for COW or non-block backed filesystems.

The right way to do this is to allow direct I/O from kernel sources
where the filesystem is in-charge of submitting the actual I/O after
the pages are handed to it.  I think Peter Zijlstra has been looking
into something like that for swap over nfs.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Jens Axboe

Hi,

loop.c currently uses the page cache interface to do IO to file backed
devices. This works reasonably well for simple things, like mapping an
iso9660 file for direct mount and other read-only workloads. Writing is
somewhat problematic, as anyone who has really used this feature can
attest to - it tends to confuse the vm (hello kswapd) since it break
dirty accounting and behaves very erratically on writeout. Did I mention
that it's pretty slow as well, for both reads and writes?

It also behaves differently than a real drive. For writes, completions
are done once they hit page cache. Since loop queues bio's async and
hands them off to a thread, you can have a huge backlog of stuff to do.
It's hard to attempt to guarentee data safety for file systems on top of
loop without making it even slower than it currently is.

Back when loop was only used for iso9660 mounting and other simple
things, this didn't matter. Now it's often used in xen (and others)
setups where we do care about performance AND writing. So the below is a
attempt at speeding up loop and making it behave like a real device.
It's a somewhat quick hack and is still missing one piece to be
complete, but I'll throw it out there for people to play with and
comment on.

So how does it work? Instead of punting IO to a thread and passing it
through the page cache, we instead attempt to send the IO directly to the
filesystem block that it maps to. loop maintains a prio tree of known
extents in the file (populated lazily on demand, as needed). Advantages
of this approach:

- It's fast, loop will basically work at device speed.
- It's fast, loop it doesn't put a huge amount of system load on the
  system when busy. When I did comparison tests on my notebook with an
  external drive, running a simple tiobench on the current in-kernel
  loop with a sparse file backing rendered the notebook basically
  unusable while the test was ongoing. The remapper version had no more
  impact than it did when used directly on the external drive.
- It behaves like a real block device.
- It's easy to support IO barriers, which is needed to ensure safety
  especially in virtualized setups.

Disadvantages:

- The file block mappings must not change while loop is using the file.
  This means that we have to ensure exclusive access to the file and
  this is the bit that is currently missing in the implementation. It
  would be nice if we could just do this via open(), ideas welcome...
- It'll tie down a bit of memory for the prio tree. This is GREATLY
  offset by the reduced page cache foot print though.
- It cannot be used with the loop encryption stuff. dm-crypt should be
  used instead, on top of loop (which, I think, is even the recommended
  way to do this today, so not a big deal).

This patch will automatically enable the new operation mode (called
fastfs). I added an ioctl (LOOP_SET_FASTFS) that should be implemented
in losetup, then we can remove this hunk in the code:

+   /*
+* This needs to be done after setup with another ioctl,
+* not automatically like this.
+*/
+   loop_init_fastfs(lo);
+

from loop_set_fd().

Patch is against 2.6.23-rc7 ('ish, as of this morning), will probably
apply easily to 2.6.22 as well.


diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 56e2304..e49bfa8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -481,16 +481,51 @@ static int do_bio_filebacked(struct loop_device *lo, 
struct bio *bio)
return ret;
 }
 
+#define __lo_throttle(wq, lock, condition) \
+do {   \
+   DEFINE_WAIT(__wait);\
+   for (;;) {  \
+   prepare_to_wait((wq), &__wait, TASK_UNINTERRUPTIBLE);   \
+   if (condition)  \
+   break;  \
+   spin_unlock_irq((lock));\
+   io_schedule();  \
+   spin_lock_irq((lock));  \
+   }   \
+   finish_wait((wq), &__wait); \
+} while (0)\
+
+#define lo_act_bio(bio)((bio)->bi_bdev)
+#define LO_BIO_THROTTLE128
+
 /*
- * Add bio to back of pending list
+ * A normal block device will throttle on request allocation. Do the same
+ * for loop to prevent millions of bio's queued internally.
+ */
+static void loop_bio_throttle(struct loop_device *lo, struct bio *bio)
+{
+   if (lo_act_bio(bio))
+   __lo_throttle(&lo->lo_bio_wait, &lo->lo_lock,
+   lo->lo_bio_cnt < LO_BIO_THROTTLE);
+}
+
+/*
+ * Add bio

40 matches

Mail list logo