Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-30 Thread Christoph Hellwig
On Sat, Dec 01, 2012 at 09:40:41AM +1100, Dave Chinner wrote:
> So it was based on this interface?

Based on that.  I had dropped the inode operation as it's not really
a generic operation but a callback for either the buffered I/O code
or direct I/O and should be treated as such.  I've also split the
single multiplexer function into individual ones, but the underlying
data structure and fundamental operations are the same.

> (I went looking for this code on google a couple of days ago so I
> could point at it and say "we should be using an iomap structure,
> not buffer heads", but it looks like I never posted it to fsdevel or
> the xfs lists...)

Your version defintively was up on your kernel.org XFS tree, that's what
I started from.

I'll have a long plane right tonight, let's see if I can get the direct
I/O version updated.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-30 Thread Dave Chinner
On Fri, Nov 30, 2012 at 11:36:01AM -0500, Christoph Hellwig wrote:
> On Fri, Nov 30, 2012 at 01:49:10PM +1100, Dave Chinner wrote:
> > > Ugh. That's a big violation of how buffer-heads are supposed to work:
> > > the block number is very much defined to be in multiples of b_size
> > > (see for example "submit_bh()" that turns it into a sector number).
> > > 
> > > But you're right. The direct-IO code really *is* violating that, and
> > > knows that get_block() ends up being defined in i_blkbits regardless
> > > of b_size.
> > 
> > Same with mpage_readpages(), so it's not just direct IO that has
> > this problem
> 
> The mpage code may actually fall back to BHs.
> 
> I have a version of the direct I/O code that uses the iomap_ops from the
> multi-page write code that you originally started.  It uses the new op
> as primary interface for direct I/O and provides a helper for
> filesystems that still use buffer heads internally.  I'll try to dust it
> off and send out a version for the current kernel.

So it was based on this interface?

(I went looking for this code on google a couple of days ago so I
could point at it and say "we should be using an iomap structure,
not buffer heads", but it looks like I never posted it to fsdevel or
the xfs lists...)


diff --git a/include/linux/fs.h b/include/linux/fs.h
index 090f0ea..e247d62 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -522,6 +522,7 @@ enum positive_aop_returns {
 struct page;
 struct address_space;
 struct writeback_control;
+struct iomap;
 
 struct iov_iter {
const struct iovec *iov;
@@ -614,6 +615,9 @@ struct address_space_operations {
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
+
+   int (*iomap)(struct address_space *mapping, loff_t pos,
+   ssize_t length, struct iomap *iomap, int cmd);
 };
 
 /*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
new file mode 100644
index 000..7708614
--- /dev/null
+++ b/include/linux/iomap.h
@@ -0,0 +1,45 @@
+#ifndef _IOMAP_H
+#define _IOMAP_H
+
+/* ->iomap a_op command types */
+#define IOMAP_READ 0x01/* read the current mapping starting at the
+  given position, trimmed to a maximum length.
+  FS's should use this to obtain and lock
+  resources within this range */
+#define IOMAP_RESERVE  0x02/* reserve space for an allocation that spans
+  the given iomap */
+#define IOMAP_ALLOCATE 0x03/* allocate space in a given iomap - must have
+  first been reserved */
+#define IOMAP_UNRESERVE0x04/* return unused reserved space for the 
given
+  iomap and used space. This will always be
+  called after a IOMAP_READ so as to allow the
+  FS to release held resources. */
+
+/* types of block ranges for multipage write mappings. */
+#define IOMAP_HOLE 0x01/* no blocks allocated, need allocation */
+#define IOMAP_DELALLOC 0x02/* delayed allocation blocks */
+#define IOMAP_MAPPED   0x03/* blocks allocated @blkno */
+#define IOMAP_UNWRITTEN0x04/* blocks allocated @blkno in unwritten 
state */
+
+struct iomap {
+   sector_tblkno;  /* first sector of mapping */
+   loff_t  offset; /* file offset of mapping, bytes */
+   ssize_t length; /* length of mapping, bytes */
+   int type;   /* type of mapping */
+   void*priv;  /* fs private data associated with map */
+};
+
+static inline bool
+iomap_needs_allocation(struct iomap *iomap)
+{
+   return iomap->type == IOMAP_HOLE;
+}
+
+/* multipage write interfaces use iomaps */
+typedef int (*mpw_actor_t)(struct address_space *mapping, void *src,
+   loff_t pos, ssize_t len, struct iomap *iomap);
+
+ssize_t multipage_write_segment(struct address_space *mapping, void *src,
+   loff_t pos, ssize_t length, mpw_actor_t actor);
+
+#endif /* _IOMAP_H */

Cheers,

Dave.
> 
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-30 Thread Linus Torvalds
On Fri, Nov 30, 2012 at 6:31 AM, Chris Mason  wrote:
> On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote:
>>
>> Same with mpage_readpages(), so it's not just direct IO that has
>> this problem
>
> I guess the good news is that block devices don't have readpages.  The
> bad news would be that we can't put readpages in without much bigger
> changes.

Well, the new block-dev branch no longer cares. It basically says "ok,
we use inode->i_blkbits, but for raw device accesses we know it's
unstable, so we'll just not use it".

So both mpage_readpages and direct-IO should be happy. And it actually
removed redundant code, so it's all good.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-30 Thread Christoph Hellwig
On Fri, Nov 30, 2012 at 01:49:10PM +1100, Dave Chinner wrote:
> > Ugh. That's a big violation of how buffer-heads are supposed to work:
> > the block number is very much defined to be in multiples of b_size
> > (see for example "submit_bh()" that turns it into a sector number).
> > 
> > But you're right. The direct-IO code really *is* violating that, and
> > knows that get_block() ends up being defined in i_blkbits regardless
> > of b_size.
> 
> Same with mpage_readpages(), so it's not just direct IO that has
> this problem

The mpage code may actually fall back to BHs.

I have a version of the direct I/O code that uses the iomap_ops from the
multi-page write code that you originally started.  It uses the new op
as primary interface for direct I/O and provides a helper for
filesystems that still use buffer heads internally.  I'll try to dust it
off and send out a version for the current kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-30 Thread Chris Mason
On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote:
> On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote:
> > On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason  
> > wrote:
> > >
> > > Just reading the new blkdev_get_blocks, it looks like we're mixing
> > > shifts.  In direct-io.c map_bh->b_size is how much we'd like to map, and
> > > it has no relation at all to the actual block size of the device.  The
> > > interface is abusing b_size to ask for as large a mapping as possible.
> > 
> > Ugh. That's a big violation of how buffer-heads are supposed to work:
> > the block number is very much defined to be in multiples of b_size
> > (see for example "submit_bh()" that turns it into a sector number).
> > 
> > But you're right. The direct-IO code really *is* violating that, and
> > knows that get_block() ends up being defined in i_blkbits regardless
> > of b_size.
> 
> Same with mpage_readpages(), so it's not just direct IO that has
> this problem

I guess the good news is that block devices don't have readpages.  The
bad news would be that we can't put readpages in without much bigger
changes.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Dave Chinner
On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason  wrote:
> >
> > Just reading the new blkdev_get_blocks, it looks like we're mixing
> > shifts.  In direct-io.c map_bh->b_size is how much we'd like to map, and
> > it has no relation at all to the actual block size of the device.  The
> > interface is abusing b_size to ask for as large a mapping as possible.
> 
> Ugh. That's a big violation of how buffer-heads are supposed to work:
> the block number is very much defined to be in multiples of b_size
> (see for example "submit_bh()" that turns it into a sector number).
> 
> But you're right. The direct-IO code really *is* violating that, and
> knows that get_block() ends up being defined in i_blkbits regardless
> of b_size.

Same with mpage_readpages(), so it's not just direct IO that has
this problem

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 07:13:02PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason  wrote:
> >
> > I searched through filemap.c for the magic i_size check that would let
> > us get away with ignoring i_blkbits in get_blocks, but its just not
> > there.  The whole fallback-to-buffered scheme seems to rely on
> > get_blocks checking for i_size.  I really hope I'm just missing
> > something.
> 
> So generic_write_checks() limits the size to i_size at for writes (and
> for "isblk").

Great, that's what I was missing.

> 
> Sure, then it will do the buffered part after that, but that should
> all be fine anyway, since by then we use the normal page cache.
> 
> For reads, generic_file_aio_read() will check pos < size, but doesn't
> seem to actually limit the size of the iovec.

I couldn't explain that either.

> 
> I'm not sure why it doesn't just do "iov_shorten()".
> 
> Anyway, having looked at actually passing in the block size to
> get_block(), I can say that is a horrible idea. There are tons of
> get_block functions (for various filesystems), and *none* of them
> really want the block size, because they tend to work on block
> indexes. And if they do want the block size, they'll just get it from
> the inode or sb, since they are filesystems and it's all stable.
> 
> So the *only* of the places that would want the block size is
> fs/block_dev.c. And the callers really already seem to do the i_size
> check, although they sometimes do it badly. And since there are fewer
> callers than there are get_block() implementations, I think we should
> just fix the callers and be done with it.
> 
> Those generic_file_aio_read/write() functions in fs/direct-io.c really
> just seem to be badly written. The fact that they may depend on the
> i_size check in get_blocks() is sad, but I think we should fix it and
> just remove the check for block devices. That's going to simplify so
> much..
> 
> I updated the 'block-dev' branch to have that simpler fs/block_dev.c
> model instead. I'll look at the iovec shortening later. It's a
> non-fast-forward thing, look out!
> 
> (I actually think we should just add the max-offset check to
> rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing,
> and we could make it do a max_offset check as well).

This is definitely easier, and I can't see any reason not to do it.  I'm
used to get_block being expensive and so it didn't even cross my mind.

We can benchmark things just to make sure.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason  wrote:
>
> I searched through filemap.c for the magic i_size check that would let
> us get away with ignoring i_blkbits in get_blocks, but its just not
> there.  The whole fallback-to-buffered scheme seems to rely on
> get_blocks checking for i_size.  I really hope I'm just missing
> something.

So generic_write_checks() limits the size to i_size at for writes (and
for "isblk").

Sure, then it will do the buffered part after that, but that should
all be fine anyway, since by then we use the normal page cache.

For reads, generic_file_aio_read() will check pos < size, but doesn't
seem to actually limit the size of the iovec.

I'm not sure why it doesn't just do "iov_shorten()".

Anyway, having looked at actually passing in the block size to
get_block(), I can say that is a horrible idea. There are tons of
get_block functions (for various filesystems), and *none* of them
really want the block size, because they tend to work on block
indexes. And if they do want the block size, they'll just get it from
the inode or sb, since they are filesystems and it's all stable.

So the *only* of the places that would want the block size is
fs/block_dev.c. And the callers really already seem to do the i_size
check, although they sometimes do it badly. And since there are fewer
callers than there are get_block() implementations, I think we should
just fix the callers and be done with it.

Those generic_file_aio_read/write() functions in fs/direct-io.c really
just seem to be badly written. The fact that they may depend on the
i_size check in get_blocks() is sad, but I think we should fix it and
just remove the check for block devices. That's going to simplify so
much..

I updated the 'block-dev' branch to have that simpler fs/block_dev.c
model instead. I'll look at the iovec shortening later. It's a
non-fast-forward thing, look out!

(I actually think we should just add the max-offset check to
rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing,
and we could make it do a max_offset check as well).

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 03:36:38PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds
>  wrote:
> >
> > But you're right. The direct-IO code really *is* violating that, and
> > knows that get_block() ends up being defined in i_blkbits regardless
> > of b_size.
> 
> It turns out fs/ioctl.c does the same - it fills in the buffer head
> with some random bh->b_size too. I think it's not even a power of two
> in that case.
> 
> And I guess it's understandable - they don't actually *use* the
> buffer, they just want the offset. So the b_size field really is just
> random crap to the users of the get_block interfaces, since they've
> never cared before.
> 
> Ugh, this was definitely a dark and disgusting underbelly of the VFS
> layer. We've not had to really touch it for a *looong* time..

I searched through filemap.c for the magic i_size check that would let
us get away with ignoring i_blkbits in get_blocks, but its just not
there.  The whole fallback-to-buffered scheme seems to rely on
get_blocks checking for i_size.  I really hope I'm just missing
something.

If we're going to change this, I'd vote for something non-bh based.  I
didn't check every single FS, but I don't think direct-IO really wants
or needs buffer heads at all.

One less wart in direct-io.c would really be nice, but I'm assuming
it'll take us at least one full release to hammer out a shiny new
get_blocks.  Passing i_blkbits would be more mechanical, since all the
filesystems would just ignore it.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds
 wrote:
>
> But you're right. The direct-IO code really *is* violating that, and
> knows that get_block() ends up being defined in i_blkbits regardless
> of b_size.

It turns out fs/ioctl.c does the same - it fills in the buffer head
with some random bh->b_size too. I think it's not even a power of two
in that case.

And I guess it's understandable - they don't actually *use* the
buffer, they just want the offset. So the b_size field really is just
random crap to the users of the get_block interfaces, since they've
never cared before.

Ugh, this was definitely a dark and disgusting underbelly of the VFS
layer. We've not had to really touch it for a *looong* time..

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason  wrote:
>
> Just reading the new blkdev_get_blocks, it looks like we're mixing
> shifts.  In direct-io.c map_bh->b_size is how much we'd like to map, and
> it has no relation at all to the actual block size of the device.  The
> interface is abusing b_size to ask for as large a mapping as possible.

Ugh. That's a big violation of how buffer-heads are supposed to work:
the block number is very much defined to be in multiples of b_size
(see for example "submit_bh()" that turns it into a sector number).

But you're right. The direct-IO code really *is* violating that, and
knows that get_block() ends up being defined in i_blkbits regardless
of b_size.

What a crock. That direct-IO code is hack-upon-hack. Whoever wrote it
should be shot.

I think the only sane way to fix is is to pass in the block size to
get_blocks(). Which we admittedly should have done long ago, so that's
not a bad fix, but without actually looking at what it involves, I
think it's going to be pretty big patch. All the filesystems that
support the interface need to update it, even if they can then ignore
it, because direct-IO does all these hacks only for the raw device.

And I think it will improve the interface, but damn, direct-IO is
still horrible for playing these kinds of games.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 01:52:22PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason  
> wrote:
> >
> > It was all a trick to get you to say the AIO code was sane.
> 
> It's only sane compared to the DIO code.
> 
> That said, I hate AIO much less these days that we've largely merged
> the code with the regular IO. It's still a horrible interface, but at
> least it is no longer a really disgusting separate implementation in
> the kernel of that horrible interface.
> 
> So yeah, I guess AIO really is pretty sane these days.
> 
> > It looks like we could use the private copy of i_blkbits that DIO is
> > already recording.
> 
> Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out.
> 
> I've pushed out two more commits to the 'block-dev' branch at
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev
> 
> in case anybody wants to take a look.
> 
> It is - as usual - entirely untested. It compiles, and I *think* that
> blkdev_get_blocks() makes a whole lot more sense this way - as you
> said, it should be byte-based (although it actually does the block
> number conversion because I worried about overflow - probably
> unnecessarily).
> 
> Comments?

Your blkdev_get_blocks emails were great reading while at the dentist,
thanks for helping me pass the time.

Just reading the new blkdev_get_blocks, it looks like we're mixing
shifts.  In direct-io.c map_bh->b_size is how much we'd like to map, and
it has no relation at all to the actual block size of the device.  The
interface is abusing b_size to ask for as large a mapping as possible.

Most importantly, it has no relation to the fs_startblk that we pass in,
which is based on inode->i_blkbits.

So your new check in blkdev_get_blocks:

   if (iblock >= end_block) {

Is wrong because iblock and end_block are based on different sizes.  I
think we have to do the eof checks inside fs/direct-io.c or change the
get_blocks interface completely.

I really thought fs/direct-io.c was already doing eof checks, but I'm
reading harder.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason  wrote:
>
> It was all a trick to get you to say the AIO code was sane.

It's only sane compared to the DIO code.

That said, I hate AIO much less these days that we've largely merged
the code with the regular IO. It's still a horrible interface, but at
least it is no longer a really disgusting separate implementation in
the kernel of that horrible interface.

So yeah, I guess AIO really is pretty sane these days.

> It looks like we could use the private copy of i_blkbits that DIO is
> already recording.

Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out.

I've pushed out two more commits to the 'block-dev' branch at

  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev

in case anybody wants to take a look.

It is - as usual - entirely untested. It compiles, and I *think* that
blkdev_get_blocks() makes a whole lot more sense this way - as you
said, it should be byte-based (although it actually does the block
number conversion because I worried about overflow - probably
unnecessarily).

Comments?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:55 AM, Linus Torvalds
 wrote:
>
> The blkdev_get_blocks() this is just sad.
>
> The underlying data structure is actually byte-based (it's
> "i_size_read(bdev->bd_inode"), but we convert it to a block-based
> number.
>
> Oops.

Oh, it's even worse than that. The DIO code ends up passing in buffer
heads that have sizes bigger than the inode i_blksize, which can cause
problems at the end of the disk. So blkdev_get_blocks() knows about
it, and will then "fix" that and shrink them down. The games with
"max_block" are hilarious.

In a really sad way.

That whole blkdev_get_blocks() function is pure and utter shit.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason  wrote:
>
> blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I
> really don't get why that isn't byte based instead.  DIO is already
> doing the shift & mask game.

The blkdev_get_blocks() this is just sad.

The underlying data structure is actually byte-based (it's
"i_size_read(bdev->bd_inode"), but we convert it to a block-based
number.

Oops.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:26 AM, Linus Torvalds
 wrote:
>
> (I now realize that Mikulas was talking about this mess, while I
> thought he was talking about the AIO code which is largely sane).

Oh wow.

The direct-IO code really doesn't seem to care at all. I don't think
it needs locking either (it seems to do everything with a private
buffer-head), and the problem appears solely to be that it reads
i_blksize multiple times, so changing it just happens to confuse the
direct-io code.

If it were to read it only once, and then use that value, it looks
like it should all JustWork(tm).

And the right thing to do would seem to just add it to the
"dio_submit" structure, that we already have. And it already *has* a
blkbits field, but that's the "IO blocksize", not the "getblocks
blocksize", if I read that mess correctly.

Of course, it then *ALREADY* has that "blkfactor" thing, which is the
difference between i_blkbits and blktbits, so it effective *does* have
i_blkbits already in the dio_submit structure. But despite it all, it
keeps re-reading i_blksize.

Christ. That code is a mess.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 12:26:06PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason  
> wrote:
> >
> > The fs/buffer.c part makes sense during a quick read.  But
> > fs/direct-io.c plays with i_blkbits too.  The semaphore was fixing real
> > bugs there.
> 
> Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be
> incestuously playing games that should be in fs/buffer.c. I thought it
> was doing the sane thing with the page cache.
> 
> (I now realize that Mikulas was talking about this mess, while I
> thought he was talking about the AIO code which is largely sane).

It was all a trick to get you to say the AIO code was sane.

It looks like we could use the private copy of i_blkbits that DIO is
already recording.

blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I
really don't get why that isn't byte based instead.  DIO is already
doing the shift & mask game.

I think only clean_blockdev_aliases is intentionally using the inode's
i_blkbits, but again that shouldn't be changing for filesystems so it
seems safe to use the DIO copy.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason  wrote:
>
> The fs/buffer.c part makes sense during a quick read.  But
> fs/direct-io.c plays with i_blkbits too.  The semaphore was fixing real
> bugs there.

Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be
incestuously playing games that should be in fs/buffer.c. I thought it
was doing the sane thing with the page cache.

(I now realize that Mikulas was talking about this mess, while I
thought he was talking about the AIO code which is largely sane).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Chris Mason
On Thu, Nov 29, 2012 at 12:02:17PM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds
>  wrote:
> >
> > I think I'll apply this for 3.7 (since it's too late to do anything
> > fancier), and then for 3.8 I will rip out all the locking entirely,
> > because looking at the fs/buffer.c patch I wrote up, it's all totally
> > unnecessary.
> >
> > Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
> > creating new buffers) simply makes the whole locking thing pointless.
> > Just make the page lock protect the block size, and make it per-page,
> > and we're done.
> 
> There's a 'block-dev' branch in my git tree, if you guys want to play
> around with it.
> 
> It actually reverts fs/block-dev.c back to the 3.6 state (except for
> some whitespace damage that I refused to re-introduce), so that part
> of the changes should be pretty safe and well tested.
> 
> The fs/buffer.c changes, of course, are new. It's largely the same
> patch I already sent out, with a small helper function to simplify it,
> and to keep the whole ACCESS_ONCE() thing in just a single place.

The fs/buffer.c part makes sense during a quick read.  But
fs/direct-io.c plays with i_blkbits too.  The semaphore was fixing real
bugs there.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds
 wrote:
>
> I think I'll apply this for 3.7 (since it's too late to do anything
> fancier), and then for 3.8 I will rip out all the locking entirely,
> because looking at the fs/buffer.c patch I wrote up, it's all totally
> unnecessary.
>
> Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
> creating new buffers) simply makes the whole locking thing pointless.
> Just make the page lock protect the block size, and make it per-page,
> and we're done.

There's a 'block-dev' branch in my git tree, if you guys want to play
around with it.

It actually reverts fs/block-dev.c back to the 3.6 state (except for
some whitespace damage that I refused to re-introduce), so that part
of the changes should be pretty safe and well tested.

The fs/buffer.c changes, of course, are new. It's largely the same
patch I already sent out, with a small helper function to simplify it,
and to keep the whole ACCESS_ONCE() thing in just a single place.

That branch may be re-based in case I get reports or acks or whatever,
so don't rely on it (or if you do, please let me know, and I'll stop
editing it).

The fact that I could just revert the fs/block-dev.c part to a known
state makes me wonder if this might be safe for 3.7 after all (the
fs/buffer.c changes all *look* safe). That way we'd not have to worry
about any new semantics (whether they be EBUSY or any possible locking
slowdowns or RT issues). But I'll think about it, and it would be good
for people to double-check my fs/buffer.c stuff.

Mikulas, does that pass your testing?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2012 at 10:23 AM, Mikulas Patocka  wrote:
>
>
> If you remove that percpu rw lock, you also need to rewrite direct i/o
> code.
>
> In theory, block device direct i/o doesn't need buffer block size at all.
> But in practice, it shares a lot of code with filesystem direct i/o, it
> reads the block size multiple times and it crashes if it changes.

If it's a filesystem, then the size will never change while it is mounted.

So only the direct-block-device case needs to be worried about, no?

And that uses __generic_file_aio_write() and friends, which in turn
use the readpage/writepage functions.

So for block devices, it should be sufficient to make
readpage/writepage (with the writing obviously having all the
"write_begin/write_end/fullpage" variants) be safe as far as I can
see.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Mikulas Patocka


On Thu, 29 Nov 2012, Linus Torvalds wrote:

> On Wed, Nov 28, 2012 at 2:01 PM, Mikulas Patocka  wrote:
> >
> > This sounds sensible. I'm sending this patch.
> 
> This looks much better.
> 
> I think I'll apply this for 3.7 (since it's too late to do anything
> fancier), and then for 3.8 I will rip out all the locking entirely,
> because looking at the fs/buffer.c patch I wrote up, it's all totally
> unnecessary.
> 
> Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
> creating new buffers) simply makes the whole locking thing pointless.
> Just make the page lock protect the block size, and make it per-page,
> and we're done.
> 
> No RCU grace period crap, no expedited mess, no nothing.
> 
>   Linus

Yes.

If you remove that percpu rw lock, you also need to rewrite direct i/o 
code.

In theory, block device direct i/o doesn't need buffer block size at all. 
But in practice, it shares a lot of code with filesystem direct i/o, it 
reads the block size multiple times and it crashes if it changes.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Do a proper locking for mmap and block size change

2012-11-29 Thread Linus Torvalds
On Wed, Nov 28, 2012 at 2:01 PM, Mikulas Patocka  wrote:
>
> This sounds sensible. I'm sending this patch.

This looks much better.

I think I'll apply this for 3.7 (since it's too late to do anything
fancier), and then for 3.8 I will rip out all the locking entirely,
because looking at the fs/buffer.c patch I wrote up, it's all totally
unnecessary.

Adding a ACCESS_ONCE() to the read of the i_blkbits value (when
creating new buffers) simply makes the whole locking thing pointless.
Just make the page lock protect the block size, and make it per-page,
and we're done.

No RCU grace period crap, no expedited mess, no nothing.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] Do a proper locking for mmap and block size change

2012-11-28 Thread Mikulas Patocka


On Wed, 28 Nov 2012, Al Viro wrote:

> On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote:
> > No, this is crap.
> > 
> > We don't introduce random hooks like this just because the block layer
> > has shit-for-brains and cannot be bothered to do things right.
> > 
> > The fact is, the whole locking in the block layer open routine is
> > total and utter crap. It doesn't lock the right thing, even with your
> > change *anyway* (or with the change Jens had). Absolutely nothing in
> > "mmap_region()" cares at all about the block-size anywhere - it's
> > generic, after all - so locking around it is f*cking pointless. There
> > is no way in hell that the caller of ->mmap can *ever* care about the
> > block size, since it never even looks at it.
> > 
> > Don't do random crap like this.
> > 
> > Why does the code think that mmap matters so much anyway? As you say,
> > the mmap itself does *nothing*. It has no impact for the block size.
> 
> ... and here I was, trying to massage a reply into form that would be
> at least borderline printable...  Anyway, this is certainly the wrong
> way to go.  We want to catch if the damn thing is mapped and mapping_mapped()
> leaves a race?  Fine, so let's not use mapping_mapped() at all.  Have a
> private vm_operations - a copy of generic_file_vm_ops with ->open()/->close()
> added to it.  Incrementing/decrementing a per-block_device atomic counter,
> with increment side done under your rwsem, to make sure that 0->positive
> transition doesn't change in critical section of set_blocksize().  And don't
> use generic_file_mmap(), just do file_accessed() and set ->vm_ops to that
> local copy.

This sounds sensible. I'm sending this patch.

Mikulas

---

Do a proper locking for mmap and block size change

For block devices, we must prevent the device from being mapped while
block size is being changed.

This was implemented by catching the mmap method and taking
bd_block_size_semaphore in it.

The problem is that the generic_file_mmap method does almost nothing, it
only sets vma->vm_ops = &generic_file_vm_ops, all the hard work is done
by the caller, mmap_region. Consequently, locking the mmap method is
insufficient, to prevent the race condition. The race condition can
happen for example this way:

process 1:
Starts mapping a block device, it goes to mmap_region, calls
file->f_op->mmap (blkdev_mmap - that acquires and releases
bd_block_size_semaphore), and reschedule happens just after blkdev_mmap
returns.

process 2:
Changes block device size, goes into set_blocksize, acquires
percpu_down_write, calls mapping_mapped to test if the device is mapped
(it still isn't). Then, it reschedules.

process 1:
Continues in mmap_region, successfully finishes the mapping.

process 2:
Continues in set_blocksize, changes the block size while the block
device is mapped. (which is incorrect and may result in a crash or
misbehavior).

To fix this possible race condition, we introduce a counter
bd_mmap_count that counts the number of vmas that maps the block device.
bd_mmap_count is incremented in blkdev_mmap and in blkdev_vm_open, it is
decremented in blkdev_vm_close. We don't allow changing block size while
bd_mmap_count is non-zero.

Signed-off-by: Mikulas Patocka 

---
 fs/block_dev.c |   49 -
 include/linux/fs.h |3 +++
 2 files changed, 35 insertions(+), 17 deletions(-)

Index: linux-3.7-rc7/fs/block_dev.c
===
--- linux-3.7-rc7.orig/fs/block_dev.c   2012-11-28 21:13:12.0 +0100
+++ linux-3.7-rc7/fs/block_dev.c2012-11-28 21:33:23.0 +0100
@@ -35,6 +35,7 @@ struct bdev_inode {
struct inode vfs_inode;
 };
 
+static const struct vm_operations_struct def_blk_vm_ops;
 static const struct address_space_operations def_blk_aops;
 
 static inline struct bdev_inode *BDEV_I(struct inode *inode)
@@ -114,16 +115,6 @@ void invalidate_bdev(struct block_device
 }
 EXPORT_SYMBOL(invalidate_bdev);
 
-static int is_bdev_mapped(struct block_device *bdev)
-{
-   int ret_val;
-   struct address_space *mapping = bdev->bd_inode->i_mapping;
-   mutex_lock(&mapping->i_mmap_mutex);
-   ret_val = mapping_mapped(mapping);
-   mutex_unlock(&mapping->i_mmap_mutex);
-   return ret_val;
-}
-
 int set_blocksize(struct block_device *bdev, int size)
 {
/* Size must be a power of two, and between 512 and PAGE_SIZE */
@@ -136,16 +127,16 @@ int set_blocksize(struct block_device *b
 
/*
 * If the block size doesn't change, don't take the write lock.
-* We check for is_bdev_mapped anyway, for consistent behavior.
+* We check for bd_mmap_count anyway, for consistent behavior.
 */
if (size == bdev->bd_block_size)
-   return is_bdev_mapped(bdev) ? -EBUSY : 0;
+   return atomic_read(&bdev->bd_mmap_count) ? -EBUSY : 0;
 
/* Prevent starting I/O or mapping the device */