Re: [PATCH v2] Do a proper locking for mmap and block size change
On Sat, Dec 01, 2012 at 09:40:41AM +1100, Dave Chinner wrote: > So it was based on this interface? Based on that. I had dropped the inode operation as it's not really a generic operation but a callback for either the buffered I/O code or direct I/O and should be treated as such. I've also split the single multiplexer function into individual ones, but the underlying data structure and fundamental operations are the same. > (I went looking for this code on google a couple of days ago so I > could point at it and say "we should be using an iomap structure, > not buffer heads", but it looks like I never posted it to fsdevel or > the xfs lists...) Your version defintively was up on your kernel.org XFS tree, that's what I started from. I'll have a long plane right tonight, let's see if I can get the direct I/O version updated. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Fri, Nov 30, 2012 at 11:36:01AM -0500, Christoph Hellwig wrote: > On Fri, Nov 30, 2012 at 01:49:10PM +1100, Dave Chinner wrote: > > > Ugh. That's a big violation of how buffer-heads are supposed to work: > > > the block number is very much defined to be in multiples of b_size > > > (see for example "submit_bh()" that turns it into a sector number). > > > > > > But you're right. The direct-IO code really *is* violating that, and > > > knows that get_block() ends up being defined in i_blkbits regardless > > > of b_size. > > > > Same with mpage_readpages(), so it's not just direct IO that has > > this problem > > The mpage code may actually fall back to BHs. > > I have a version of the direct I/O code that uses the iomap_ops from the > multi-page write code that you originally started. It uses the new op > as primary interface for direct I/O and provides a helper for > filesystems that still use buffer heads internally. I'll try to dust it > off and send out a version for the current kernel. So it was based on this interface? (I went looking for this code on google a couple of days ago so I could point at it and say "we should be using an iomap structure, not buffer heads", but it looks like I never posted it to fsdevel or the xfs lists...) diff --git a/include/linux/fs.h b/include/linux/fs.h index 090f0ea..e247d62 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -522,6 +522,7 @@ enum positive_aop_returns { struct page; struct address_space; struct writeback_control; +struct iomap; struct iov_iter { const struct iovec *iov; @@ -614,6 +615,9 @@ struct address_space_operations { int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); int (*error_remove_page)(struct address_space *, struct page *); + + int (*iomap)(struct address_space *mapping, loff_t pos, + ssize_t length, struct iomap *iomap, int cmd); }; /* diff --git a/include/linux/iomap.h b/include/linux/iomap.h new file mode 100644 index 000..7708614 --- /dev/null +++ b/include/linux/iomap.h @@ -0,0 +1,45 @@ +#ifndef _IOMAP_H +#define _IOMAP_H + +/* ->iomap a_op command types */ +#define IOMAP_READ 0x01/* read the current mapping starting at the + given position, trimmed to a maximum length. + FS's should use this to obtain and lock + resources within this range */ +#define IOMAP_RESERVE 0x02/* reserve space for an allocation that spans + the given iomap */ +#define IOMAP_ALLOCATE 0x03/* allocate space in a given iomap - must have + first been reserved */ +#define IOMAP_UNRESERVE0x04/* return unused reserved space for the given + iomap and used space. This will always be + called after a IOMAP_READ so as to allow the + FS to release held resources. */ + +/* types of block ranges for multipage write mappings. */ +#define IOMAP_HOLE 0x01/* no blocks allocated, need allocation */ +#define IOMAP_DELALLOC 0x02/* delayed allocation blocks */ +#define IOMAP_MAPPED 0x03/* blocks allocated @blkno */ +#define IOMAP_UNWRITTEN0x04/* blocks allocated @blkno in unwritten state */ + +struct iomap { + sector_tblkno; /* first sector of mapping */ + loff_t offset; /* file offset of mapping, bytes */ + ssize_t length; /* length of mapping, bytes */ + int type; /* type of mapping */ + void*priv; /* fs private data associated with map */ +}; + +static inline bool +iomap_needs_allocation(struct iomap *iomap) +{ + return iomap->type == IOMAP_HOLE; +} + +/* multipage write interfaces use iomaps */ +typedef int (*mpw_actor_t)(struct address_space *mapping, void *src, + loff_t pos, ssize_t len, struct iomap *iomap); + +ssize_t multipage_write_segment(struct address_space *mapping, void *src, + loff_t pos, ssize_t length, mpw_actor_t actor); + +#endif /* _IOMAP_H */ Cheers, Dave. > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Fri, Nov 30, 2012 at 6:31 AM, Chris Mason wrote: > On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote: >> >> Same with mpage_readpages(), so it's not just direct IO that has >> this problem > > I guess the good news is that block devices don't have readpages. The > bad news would be that we can't put readpages in without much bigger > changes. Well, the new block-dev branch no longer cares. It basically says "ok, we use inode->i_blkbits, but for raw device accesses we know it's unstable, so we'll just not use it". So both mpage_readpages and direct-IO should be happy. And it actually removed redundant code, so it's all good. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Fri, Nov 30, 2012 at 01:49:10PM +1100, Dave Chinner wrote: > > Ugh. That's a big violation of how buffer-heads are supposed to work: > > the block number is very much defined to be in multiples of b_size > > (see for example "submit_bh()" that turns it into a sector number). > > > > But you're right. The direct-IO code really *is* violating that, and > > knows that get_block() ends up being defined in i_blkbits regardless > > of b_size. > > Same with mpage_readpages(), so it's not just direct IO that has > this problem The mpage code may actually fall back to BHs. I have a version of the direct I/O code that uses the iomap_ops from the multi-page write code that you originally started. It uses the new op as primary interface for direct I/O and provides a helper for filesystems that still use buffer heads internally. I'll try to dust it off and send out a version for the current kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 07:49:10PM -0700, Dave Chinner wrote: > On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote: > > On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason > > wrote: > > > > > > Just reading the new blkdev_get_blocks, it looks like we're mixing > > > shifts. In direct-io.c map_bh->b_size is how much we'd like to map, and > > > it has no relation at all to the actual block size of the device. The > > > interface is abusing b_size to ask for as large a mapping as possible. > > > > Ugh. That's a big violation of how buffer-heads are supposed to work: > > the block number is very much defined to be in multiples of b_size > > (see for example "submit_bh()" that turns it into a sector number). > > > > But you're right. The direct-IO code really *is* violating that, and > > knows that get_block() ends up being defined in i_blkbits regardless > > of b_size. > > Same with mpage_readpages(), so it's not just direct IO that has > this problem I guess the good news is that block devices don't have readpages. The bad news would be that we can't put readpages in without much bigger changes. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 02:16:50PM -0800, Linus Torvalds wrote: > On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason wrote: > > > > Just reading the new blkdev_get_blocks, it looks like we're mixing > > shifts. In direct-io.c map_bh->b_size is how much we'd like to map, and > > it has no relation at all to the actual block size of the device. The > > interface is abusing b_size to ask for as large a mapping as possible. > > Ugh. That's a big violation of how buffer-heads are supposed to work: > the block number is very much defined to be in multiples of b_size > (see for example "submit_bh()" that turns it into a sector number). > > But you're right. The direct-IO code really *is* violating that, and > knows that get_block() ends up being defined in i_blkbits regardless > of b_size. Same with mpage_readpages(), so it's not just direct IO that has this problem Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 07:13:02PM -0700, Linus Torvalds wrote: > On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason wrote: > > > > I searched through filemap.c for the magic i_size check that would let > > us get away with ignoring i_blkbits in get_blocks, but its just not > > there. The whole fallback-to-buffered scheme seems to rely on > > get_blocks checking for i_size. I really hope I'm just missing > > something. > > So generic_write_checks() limits the size to i_size at for writes (and > for "isblk"). Great, that's what I was missing. > > Sure, then it will do the buffered part after that, but that should > all be fine anyway, since by then we use the normal page cache. > > For reads, generic_file_aio_read() will check pos < size, but doesn't > seem to actually limit the size of the iovec. I couldn't explain that either. > > I'm not sure why it doesn't just do "iov_shorten()". > > Anyway, having looked at actually passing in the block size to > get_block(), I can say that is a horrible idea. There are tons of > get_block functions (for various filesystems), and *none* of them > really want the block size, because they tend to work on block > indexes. And if they do want the block size, they'll just get it from > the inode or sb, since they are filesystems and it's all stable. > > So the *only* of the places that would want the block size is > fs/block_dev.c. And the callers really already seem to do the i_size > check, although they sometimes do it badly. And since there are fewer > callers than there are get_block() implementations, I think we should > just fix the callers and be done with it. > > Those generic_file_aio_read/write() functions in fs/direct-io.c really > just seem to be badly written. The fact that they may depend on the > i_size check in get_blocks() is sad, but I think we should fix it and > just remove the check for block devices. That's going to simplify so > much.. > > I updated the 'block-dev' branch to have that simpler fs/block_dev.c > model instead. I'll look at the iovec shortening later. It's a > non-fast-forward thing, look out! > > (I actually think we should just add the max-offset check to > rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing, > and we could make it do a max_offset check as well). This is definitely easier, and I can't see any reason not to do it. I'm used to get_block being expensive and so it didn't even cross my mind. We can benchmark things just to make sure. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 5:16 PM, Chris Mason wrote: > > I searched through filemap.c for the magic i_size check that would let > us get away with ignoring i_blkbits in get_blocks, but its just not > there. The whole fallback-to-buffered scheme seems to rely on > get_blocks checking for i_size. I really hope I'm just missing > something. So generic_write_checks() limits the size to i_size at for writes (and for "isblk"). Sure, then it will do the buffered part after that, but that should all be fine anyway, since by then we use the normal page cache. For reads, generic_file_aio_read() will check pos < size, but doesn't seem to actually limit the size of the iovec. I'm not sure why it doesn't just do "iov_shorten()". Anyway, having looked at actually passing in the block size to get_block(), I can say that is a horrible idea. There are tons of get_block functions (for various filesystems), and *none* of them really want the block size, because they tend to work on block indexes. And if they do want the block size, they'll just get it from the inode or sb, since they are filesystems and it's all stable. So the *only* of the places that would want the block size is fs/block_dev.c. And the callers really already seem to do the i_size check, although they sometimes do it badly. And since there are fewer callers than there are get_block() implementations, I think we should just fix the callers and be done with it. Those generic_file_aio_read/write() functions in fs/direct-io.c really just seem to be badly written. The fact that they may depend on the i_size check in get_blocks() is sad, but I think we should fix it and just remove the check for block devices. That's going to simplify so much.. I updated the 'block-dev' branch to have that simpler fs/block_dev.c model instead. I'll look at the iovec shortening later. It's a non-fast-forward thing, look out! (I actually think we should just add the max-offset check to rw_copy_check_uvector(). That one already does the MAX_RW_COUNT thing, and we could make it do a max_offset check as well). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 03:36:38PM -0700, Linus Torvalds wrote: > On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds > wrote: > > > > But you're right. The direct-IO code really *is* violating that, and > > knows that get_block() ends up being defined in i_blkbits regardless > > of b_size. > > It turns out fs/ioctl.c does the same - it fills in the buffer head > with some random bh->b_size too. I think it's not even a power of two > in that case. > > And I guess it's understandable - they don't actually *use* the > buffer, they just want the offset. So the b_size field really is just > random crap to the users of the get_block interfaces, since they've > never cared before. > > Ugh, this was definitely a dark and disgusting underbelly of the VFS > layer. We've not had to really touch it for a *looong* time.. I searched through filemap.c for the magic i_size check that would let us get away with ignoring i_blkbits in get_blocks, but its just not there. The whole fallback-to-buffered scheme seems to rely on get_blocks checking for i_size. I really hope I'm just missing something. If we're going to change this, I'd vote for something non-bh based. I didn't check every single FS, but I don't think direct-IO really wants or needs buffer heads at all. One less wart in direct-io.c would really be nice, but I'm assuming it'll take us at least one full release to hammer out a shiny new get_blocks. Passing i_blkbits would be more mechanical, since all the filesystems would just ignore it. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 2:16 PM, Linus Torvalds wrote: > > But you're right. The direct-IO code really *is* violating that, and > knows that get_block() ends up being defined in i_blkbits regardless > of b_size. It turns out fs/ioctl.c does the same - it fills in the buffer head with some random bh->b_size too. I think it's not even a power of two in that case. And I guess it's understandable - they don't actually *use* the buffer, they just want the offset. So the b_size field really is just random crap to the users of the get_block interfaces, since they've never cared before. Ugh, this was definitely a dark and disgusting underbelly of the VFS layer. We've not had to really touch it for a *looong* time.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 1:29 PM, Chris Mason wrote: > > Just reading the new blkdev_get_blocks, it looks like we're mixing > shifts. In direct-io.c map_bh->b_size is how much we'd like to map, and > it has no relation at all to the actual block size of the device. The > interface is abusing b_size to ask for as large a mapping as possible. Ugh. That's a big violation of how buffer-heads are supposed to work: the block number is very much defined to be in multiples of b_size (see for example "submit_bh()" that turns it into a sector number). But you're right. The direct-IO code really *is* violating that, and knows that get_block() ends up being defined in i_blkbits regardless of b_size. What a crock. That direct-IO code is hack-upon-hack. Whoever wrote it should be shot. I think the only sane way to fix is is to pass in the block size to get_blocks(). Which we admittedly should have done long ago, so that's not a bad fix, but without actually looking at what it involves, I think it's going to be pretty big patch. All the filesystems that support the interface need to update it, even if they can then ignore it, because direct-IO does all these hacks only for the raw device. And I think it will improve the interface, but damn, direct-IO is still horrible for playing these kinds of games. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 01:52:22PM -0700, Linus Torvalds wrote: > On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason > wrote: > > > > It was all a trick to get you to say the AIO code was sane. > > It's only sane compared to the DIO code. > > That said, I hate AIO much less these days that we've largely merged > the code with the regular IO. It's still a horrible interface, but at > least it is no longer a really disgusting separate implementation in > the kernel of that horrible interface. > > So yeah, I guess AIO really is pretty sane these days. > > > It looks like we could use the private copy of i_blkbits that DIO is > > already recording. > > Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out. > > I've pushed out two more commits to the 'block-dev' branch at > > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev > > in case anybody wants to take a look. > > It is - as usual - entirely untested. It compiles, and I *think* that > blkdev_get_blocks() makes a whole lot more sense this way - as you > said, it should be byte-based (although it actually does the block > number conversion because I worried about overflow - probably > unnecessarily). > > Comments? Your blkdev_get_blocks emails were great reading while at the dentist, thanks for helping me pass the time. Just reading the new blkdev_get_blocks, it looks like we're mixing shifts. In direct-io.c map_bh->b_size is how much we'd like to map, and it has no relation at all to the actual block size of the device. The interface is abusing b_size to ask for as large a mapping as possible. Most importantly, it has no relation to the fs_startblk that we pass in, which is based on inode->i_blkbits. So your new check in blkdev_get_blocks: if (iblock >= end_block) { Is wrong because iblock and end_block are based on different sizes. I think we have to do the eof checks inside fs/direct-io.c or change the get_blocks interface completely. I really thought fs/direct-io.c was already doing eof checks, but I'm reading harder. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason wrote: > > It was all a trick to get you to say the AIO code was sane. It's only sane compared to the DIO code. That said, I hate AIO much less these days that we've largely merged the code with the regular IO. It's still a horrible interface, but at least it is no longer a really disgusting separate implementation in the kernel of that horrible interface. So yeah, I guess AIO really is pretty sane these days. > It looks like we could use the private copy of i_blkbits that DIO is > already recording. Yes. But that didn't fix the blkdev_get_blocks() mess you pointed out. I've pushed out two more commits to the 'block-dev' branch at git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux block-dev in case anybody wants to take a look. It is - as usual - entirely untested. It compiles, and I *think* that blkdev_get_blocks() makes a whole lot more sense this way - as you said, it should be byte-based (although it actually does the block number conversion because I worried about overflow - probably unnecessarily). Comments? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 11:55 AM, Linus Torvalds wrote: > > The blkdev_get_blocks() this is just sad. > > The underlying data structure is actually byte-based (it's > "i_size_read(bdev->bd_inode"), but we convert it to a block-based > number. > > Oops. Oh, it's even worse than that. The DIO code ends up passing in buffer heads that have sizes bigger than the inode i_blksize, which can cause problems at the end of the disk. So blkdev_get_blocks() knows about it, and will then "fix" that and shrink them down. The games with "max_block" are hilarious. In a really sad way. That whole blkdev_get_blocks() function is pure and utter shit. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 11:48 AM, Chris Mason wrote: > > blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I > really don't get why that isn't byte based instead. DIO is already > doing the shift & mask game. The blkdev_get_blocks() this is just sad. The underlying data structure is actually byte-based (it's "i_size_read(bdev->bd_inode"), but we convert it to a block-based number. Oops. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 11:26 AM, Linus Torvalds wrote: > > (I now realize that Mikulas was talking about this mess, while I > thought he was talking about the AIO code which is largely sane). Oh wow. The direct-IO code really doesn't seem to care at all. I don't think it needs locking either (it seems to do everything with a private buffer-head), and the problem appears solely to be that it reads i_blksize multiple times, so changing it just happens to confuse the direct-io code. If it were to read it only once, and then use that value, it looks like it should all JustWork(tm). And the right thing to do would seem to just add it to the "dio_submit" structure, that we already have. And it already *has* a blkbits field, but that's the "IO blocksize", not the "getblocks blocksize", if I read that mess correctly. Of course, it then *ALREADY* has that "blkfactor" thing, which is the difference between i_blkbits and blktbits, so it effective *does* have i_blkbits already in the dio_submit structure. But despite it all, it keeps re-reading i_blksize. Christ. That code is a mess. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 12:26:06PM -0700, Linus Torvalds wrote: > On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason > wrote: > > > > The fs/buffer.c part makes sense during a quick read. But > > fs/direct-io.c plays with i_blkbits too. The semaphore was fixing real > > bugs there. > > Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be > incestuously playing games that should be in fs/buffer.c. I thought it > was doing the sane thing with the page cache. > > (I now realize that Mikulas was talking about this mess, while I > thought he was talking about the AIO code which is largely sane). It was all a trick to get you to say the AIO code was sane. It looks like we could use the private copy of i_blkbits that DIO is already recording. blkdev_get_blocks (called during DIO) is also checking i_blkbits, but I really don't get why that isn't byte based instead. DIO is already doing the shift & mask game. I think only clean_blockdev_aliases is intentionally using the inode's i_blkbits, but again that shouldn't be changing for filesystems so it seems safe to use the DIO copy. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 11:15 AM, Chris Mason wrote: > > The fs/buffer.c part makes sense during a quick read. But > fs/direct-io.c plays with i_blkbits too. The semaphore was fixing real > bugs there. Ugh. I _hate_ direct-IO. What a mess. And yeah, it seems to be incestuously playing games that should be in fs/buffer.c. I thought it was doing the sane thing with the page cache. (I now realize that Mikulas was talking about this mess, while I thought he was talking about the AIO code which is largely sane). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 12:02:17PM -0700, Linus Torvalds wrote: > On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds > wrote: > > > > I think I'll apply this for 3.7 (since it's too late to do anything > > fancier), and then for 3.8 I will rip out all the locking entirely, > > because looking at the fs/buffer.c patch I wrote up, it's all totally > > unnecessary. > > > > Adding a ACCESS_ONCE() to the read of the i_blkbits value (when > > creating new buffers) simply makes the whole locking thing pointless. > > Just make the page lock protect the block size, and make it per-page, > > and we're done. > > There's a 'block-dev' branch in my git tree, if you guys want to play > around with it. > > It actually reverts fs/block-dev.c back to the 3.6 state (except for > some whitespace damage that I refused to re-introduce), so that part > of the changes should be pretty safe and well tested. > > The fs/buffer.c changes, of course, are new. It's largely the same > patch I already sent out, with a small helper function to simplify it, > and to keep the whole ACCESS_ONCE() thing in just a single place. The fs/buffer.c part makes sense during a quick read. But fs/direct-io.c plays with i_blkbits too. The semaphore was fixing real bugs there. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 9:19 AM, Linus Torvalds wrote: > > I think I'll apply this for 3.7 (since it's too late to do anything > fancier), and then for 3.8 I will rip out all the locking entirely, > because looking at the fs/buffer.c patch I wrote up, it's all totally > unnecessary. > > Adding a ACCESS_ONCE() to the read of the i_blkbits value (when > creating new buffers) simply makes the whole locking thing pointless. > Just make the page lock protect the block size, and make it per-page, > and we're done. There's a 'block-dev' branch in my git tree, if you guys want to play around with it. It actually reverts fs/block-dev.c back to the 3.6 state (except for some whitespace damage that I refused to re-introduce), so that part of the changes should be pretty safe and well tested. The fs/buffer.c changes, of course, are new. It's largely the same patch I already sent out, with a small helper function to simplify it, and to keep the whole ACCESS_ONCE() thing in just a single place. That branch may be re-based in case I get reports or acks or whatever, so don't rely on it (or if you do, please let me know, and I'll stop editing it). The fact that I could just revert the fs/block-dev.c part to a known state makes me wonder if this might be safe for 3.7 after all (the fs/buffer.c changes all *look* safe). That way we'd not have to worry about any new semantics (whether they be EBUSY or any possible locking slowdowns or RT issues). But I'll think about it, and it would be good for people to double-check my fs/buffer.c stuff. Mikulas, does that pass your testing? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, Nov 29, 2012 at 10:23 AM, Mikulas Patocka wrote: > > > If you remove that percpu rw lock, you also need to rewrite direct i/o > code. > > In theory, block device direct i/o doesn't need buffer block size at all. > But in practice, it shares a lot of code with filesystem direct i/o, it > reads the block size multiple times and it crashes if it changes. If it's a filesystem, then the size will never change while it is mounted. So only the direct-block-device case needs to be worried about, no? And that uses __generic_file_aio_write() and friends, which in turn use the readpage/writepage functions. So for block devices, it should be sufficient to make readpage/writepage (with the writing obviously having all the "write_begin/write_end/fullpage" variants) be safe as far as I can see. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Thu, 29 Nov 2012, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 2:01 PM, Mikulas Patocka wrote: > > > > This sounds sensible. I'm sending this patch. > > This looks much better. > > I think I'll apply this for 3.7 (since it's too late to do anything > fancier), and then for 3.8 I will rip out all the locking entirely, > because looking at the fs/buffer.c patch I wrote up, it's all totally > unnecessary. > > Adding a ACCESS_ONCE() to the read of the i_blkbits value (when > creating new buffers) simply makes the whole locking thing pointless. > Just make the page lock protect the block size, and make it per-page, > and we're done. > > No RCU grace period crap, no expedited mess, no nothing. > > Linus Yes. If you remove that percpu rw lock, you also need to rewrite direct i/o code. In theory, block device direct i/o doesn't need buffer block size at all. But in practice, it shares a lot of code with filesystem direct i/o, it reads the block size multiple times and it crashes if it changes. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Do a proper locking for mmap and block size change
On Wed, Nov 28, 2012 at 2:01 PM, Mikulas Patocka wrote: > > This sounds sensible. I'm sending this patch. This looks much better. I think I'll apply this for 3.7 (since it's too late to do anything fancier), and then for 3.8 I will rip out all the locking entirely, because looking at the fs/buffer.c patch I wrote up, it's all totally unnecessary. Adding a ACCESS_ONCE() to the read of the i_blkbits value (when creating new buffers) simply makes the whole locking thing pointless. Just make the page lock protect the block size, and make it per-page, and we're done. No RCU grace period crap, no expedited mess, no nothing. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] Do a proper locking for mmap and block size change
On Wed, 28 Nov 2012, Al Viro wrote: > On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote: > > No, this is crap. > > > > We don't introduce random hooks like this just because the block layer > > has shit-for-brains and cannot be bothered to do things right. > > > > The fact is, the whole locking in the block layer open routine is > > total and utter crap. It doesn't lock the right thing, even with your > > change *anyway* (or with the change Jens had). Absolutely nothing in > > "mmap_region()" cares at all about the block-size anywhere - it's > > generic, after all - so locking around it is f*cking pointless. There > > is no way in hell that the caller of ->mmap can *ever* care about the > > block size, since it never even looks at it. > > > > Don't do random crap like this. > > > > Why does the code think that mmap matters so much anyway? As you say, > > the mmap itself does *nothing*. It has no impact for the block size. > > ... and here I was, trying to massage a reply into form that would be > at least borderline printable... Anyway, this is certainly the wrong > way to go. We want to catch if the damn thing is mapped and mapping_mapped() > leaves a race? Fine, so let's not use mapping_mapped() at all. Have a > private vm_operations - a copy of generic_file_vm_ops with ->open()/->close() > added to it. Incrementing/decrementing a per-block_device atomic counter, > with increment side done under your rwsem, to make sure that 0->positive > transition doesn't change in critical section of set_blocksize(). And don't > use generic_file_mmap(), just do file_accessed() and set ->vm_ops to that > local copy. This sounds sensible. I'm sending this patch. Mikulas --- Do a proper locking for mmap and block size change For block devices, we must prevent the device from being mapped while block size is being changed. This was implemented by catching the mmap method and taking bd_block_size_semaphore in it. The problem is that the generic_file_mmap method does almost nothing, it only sets vma->vm_ops = &generic_file_vm_ops, all the hard work is done by the caller, mmap_region. Consequently, locking the mmap method is insufficient, to prevent the race condition. The race condition can happen for example this way: process 1: Starts mapping a block device, it goes to mmap_region, calls file->f_op->mmap (blkdev_mmap - that acquires and releases bd_block_size_semaphore), and reschedule happens just after blkdev_mmap returns. process 2: Changes block device size, goes into set_blocksize, acquires percpu_down_write, calls mapping_mapped to test if the device is mapped (it still isn't). Then, it reschedules. process 1: Continues in mmap_region, successfully finishes the mapping. process 2: Continues in set_blocksize, changes the block size while the block device is mapped. (which is incorrect and may result in a crash or misbehavior). To fix this possible race condition, we introduce a counter bd_mmap_count that counts the number of vmas that maps the block device. bd_mmap_count is incremented in blkdev_mmap and in blkdev_vm_open, it is decremented in blkdev_vm_close. We don't allow changing block size while bd_mmap_count is non-zero. Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 49 - include/linux/fs.h |3 +++ 2 files changed, 35 insertions(+), 17 deletions(-) Index: linux-3.7-rc7/fs/block_dev.c === --- linux-3.7-rc7.orig/fs/block_dev.c 2012-11-28 21:13:12.0 +0100 +++ linux-3.7-rc7/fs/block_dev.c2012-11-28 21:33:23.0 +0100 @@ -35,6 +35,7 @@ struct bdev_inode { struct inode vfs_inode; }; +static const struct vm_operations_struct def_blk_vm_ops; static const struct address_space_operations def_blk_aops; static inline struct bdev_inode *BDEV_I(struct inode *inode) @@ -114,16 +115,6 @@ void invalidate_bdev(struct block_device } EXPORT_SYMBOL(invalidate_bdev); -static int is_bdev_mapped(struct block_device *bdev) -{ - int ret_val; - struct address_space *mapping = bdev->bd_inode->i_mapping; - mutex_lock(&mapping->i_mmap_mutex); - ret_val = mapping_mapped(mapping); - mutex_unlock(&mapping->i_mmap_mutex); - return ret_val; -} - int set_blocksize(struct block_device *bdev, int size) { /* Size must be a power of two, and between 512 and PAGE_SIZE */ @@ -136,16 +127,16 @@ int set_blocksize(struct block_device *b /* * If the block size doesn't change, don't take the write lock. -* We check for is_bdev_mapped anyway, for consistent behavior. +* We check for bd_mmap_count anyway, for consistent behavior. */ if (size == bdev->bd_block_size) - return is_bdev_mapped(bdev) ? -EBUSY : 0; + return atomic_read(&bdev->bd_mmap_count) ? -EBUSY : 0; /* Prevent starting I/O or mapping the device */