Re: [PATCH] block devices: validate block device capacity

2014-02-03 Thread Christoph Hellwig
On Fri, Jan 31, 2014 at 03:20:17AM -0500, Mikulas Patocka wrote:
 So if you think you can support 16TiB devices and leave pgoff_t 32-bit, 
 send a patch that does it.
 
 Until you make it, you should apply the patch that I sent, that prevents 
 kernel lockups or data corruption when the user uses 16TiB device on 
 32-bit kernel.

Exactly.  I had actually looked into support for  16TiB devices for
a NAS use case a while ago, but when explaining the effort involves
the idea was dropped quickly.  The Linux block device is too deeply
tied to the pagecache to make it easily feasible.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block devices: validate block device capacity

2014-02-03 Thread Mikulas Patocka


On Mon, 3 Feb 2014, Christoph Hellwig wrote:

 On Fri, Jan 31, 2014 at 03:20:17AM -0500, Mikulas Patocka wrote:
  So if you think you can support 16TiB devices and leave pgoff_t 32-bit, 
  send a patch that does it.
  
  Until you make it, you should apply the patch that I sent, that prevents 
  kernel lockups or data corruption when the user uses 16TiB device on 
  32-bit kernel.
 
 Exactly.  I had actually looked into support for  16TiB devices for
 a NAS use case a while ago, but when explaining the effort involves
 the idea was dropped quickly.  The Linux block device is too deeply
 tied to the pagecache to make it easily feasible.

The memory management routines use pgoff_t, so we could define pgoff_t to 
be 64-bit type. But there is lib/radix_tree.c that uses unsigned long as 
an index into the radix tree - and pgoff_t is cast to unsigned long when 
calling the radix_tree routines - so we'd need to change lib/radix_tree to 
use pgoff_t.

Then, there may be other places where pgoff_t is cast to unsigned long and 
they are not trivial to find (one could enable some extra compiler 
warnings about truncating values when casting them, but I suppose, this 
would trigger a lot of false positives). This needs some deep review by 
people who designed the memory management code.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block devices: validate block device capacity

2014-01-31 Thread Mikulas Patocka


On Thu, 30 Jan 2014, James Bottomley wrote:

  So, if you want 64-bit page offsets, you need to increase pgoff_t size, 
  and that will increase the limit for both files and block devices.
 
 No.  The point is the page cache mapping of the device uses a
 manufactured inode saved in the backing device. It looks fixable in the
 buffer code before the page cache gets involved.

So if you think you can support 16TiB devices and leave pgoff_t 32-bit, 
send a patch that does it.

Until you make it, you should apply the patch that I sent, that prevents 
kernel lockups or data corruption when the user uses 16TiB device on 
32-bit kernel.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block devices: validate block device capacity

2014-01-30 Thread Mikulas Patocka
When running the LVM2 testsuite on 32-bit kernel, there are unkillable
processes stuck in the kernel consuming 100% CPU:
blkid   R running  0  2005   1409 0x0004
ce009d00 0082 ffcf c11280ba 0060 560b5dfd 3111 00fe41cb
 ce009d00  d51cfeb0  001e 0002 
0002 c10748c1 0002 c106cca4    
Call Trace:
[c11280ba] ? radix_tree_next_chunk+0xda/0x2c0
[c10748c1] ? release_pages+0x61/0x160
[c106cca4] ? find_get_pages+0x84/0x100
[c1251fbe] ? _cond_resched+0x1e/0x40
[c10758cb] ? truncate_inode_pages_range+0x12b/0x440
[c1075cb7] ? truncate_inode_pages+0x17/0x20
[c10cf2ba] ? __blkdev_put+0x3a/0x140
[c10d02db] ? blkdev_close+0x1b/0x40
[c10a60b2] ? __fput+0x72/0x1c0
[c1039461] ? task_work_run+0x61/0xa0
[c1253b6f] ? work_notifysig+0x24/0x35

This is caused by the fact that the LVM2 testsuite creates 64TB device.
The kernel uses unsigned long to index pages in files and block devices,
on 64TB device unsigned long overflows (it can address up to 16TB with
4k pages), causing the infinite loop.

On 32-bit architectures, we must limit block device size to
PAGE_SIZE*(2^32-1).

The bug with untested device size is pervasive across the whole kernel, 
some drivers test that the device size fits in sector_t, but this test is 
not sufficient on 32-bit architectures. This patch introduces a new 
function validate_disk_capacity that tests if the disk capacity is OK for 
the current kernel and modifies the drivers brd, ide-gd, dm, sd to use it.

Signed-off-by: Mikulas Patocka mpato...@redhat.com

---
 block/genhd.c |   23 +++
 drivers/block/brd.c   |   15 +++
 drivers/ide/ide-gd.c  |8 
 drivers/md/dm-ioctl.c |3 +--
 drivers/md/dm-table.c |   14 +-
 drivers/scsi/sd.c |   20 +++-
 include/linux/device-mapper.h |2 +-
 include/linux/genhd.h |2 ++
 8 files changed, 70 insertions(+), 17 deletions(-)

Index: linux-2.6-compile/block/genhd.c
===
--- linux-2.6-compile.orig/block/genhd.c2014-01-30 17:23:15.0 
+0100
+++ linux-2.6-compile/block/genhd.c 2014-01-30 19:28:42.0 +0100
@@ -1835,3 +1835,26 @@ static void disk_release_events(struct g
WARN_ON_ONCE(disk-ev  disk-ev-block != 1);
kfree(disk-ev);
 }
+
+int validate_disk_capacity(u64 n_sectors, const char **reason)
+{
+   u64 n_pages;
+   if (n_sectors  9  9 != n_sectors) {
+   if (reason)
+   *reason = The number of bytes is greater than 2^64.;
+   return -EOVERFLOW;
+   }
+   n_pages = (n_sectors + (1  (PAGE_SHIFT - 9)) - 1)  (PAGE_SHIFT - 9);
+   if (n_pages  ULONG_MAX) {
+   if (reason)
+   *reason = Use 64-bit kernel.;
+   return -EFBIG;
+   }
+   if (n_sectors != (sector_t)n_sectors) {
+   if (reason)
+   *reason = Use a kernel compiled with support for large 
block devices.;
+   return -ENOSPC;
+   }
+   return 0;
+}
+EXPORT_SYMBOL(validate_disk_capacity);
Index: linux-2.6-compile/drivers/block/brd.c
===
--- linux-2.6-compile.orig/drivers/block/brd.c  2014-01-30 17:23:15.0 
+0100
+++ linux-2.6-compile/drivers/block/brd.c   2014-01-30 19:26:51.0 
+0100
@@ -429,12 +429,12 @@ static const struct block_device_operati
  * And now the modules code and kernel interface.
  */
 static int rd_nr;
-int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
+static unsigned rd_size = CONFIG_BLK_DEV_RAM_SIZE;
 static int max_part;
 static int part_shift;
 module_param(rd_nr, int, S_IRUGO);
 MODULE_PARM_DESC(rd_nr, Maximum number of brd devices);
-module_param(rd_size, int, S_IRUGO);
+module_param(rd_size, uint, S_IRUGO);
 MODULE_PARM_DESC(rd_size, Size of each RAM disk in kbytes.);
 module_param(max_part, int, S_IRUGO);
 MODULE_PARM_DESC(max_part, Maximum number of partitions per RAM disk);
@@ -446,7 +446,7 @@ MODULE_ALIAS(rd);
 /* Legacy boot options - nonmodular */
 static int __init ramdisk_size(char *str)
 {
-   rd_size = simple_strtol(str, NULL, 0);
+   rd_size = simple_strtoul(str, NULL, 0);
return 1;
 }
 __setup(ramdisk_size=, ramdisk_size);
@@ -463,6 +463,13 @@ static struct brd_device *brd_alloc(int
 {
struct brd_device *brd;
struct gendisk *disk;
+   u64 capacity = (u64)rd_size * 2;
+   const char *reason;
+
+   if (validate_disk_capacity(capacity, reason)) {
+   printk(KERN_ERR brd: disk is too big: %s\n, reason);
+   goto out;
+   }
 
brd = kzalloc(sizeof(*brd), GFP_KERNEL);
if (!brd)
@@ -493,7 +500,7 @@ static struct brd_device *brd_alloc(int
disk-queue = brd-brd_queue;
disk-flags |= 

Re: [PATCH] block devices: validate block device capacity

2014-01-30 Thread James Bottomley
On Thu, 2014-01-30 at 15:40 -0500, Mikulas Patocka wrote:
 When running the LVM2 testsuite on 32-bit kernel, there are unkillable
 processes stuck in the kernel consuming 100% CPU:
 blkid   R running  0  2005   1409 0x0004
 ce009d00 0082 ffcf c11280ba 0060 560b5dfd 3111 00fe41cb
  ce009d00  d51cfeb0  001e 0002 
 0002 c10748c1 0002 c106cca4    
 Call Trace:
 [c11280ba] ? radix_tree_next_chunk+0xda/0x2c0
 [c10748c1] ? release_pages+0x61/0x160
 [c106cca4] ? find_get_pages+0x84/0x100
 [c1251fbe] ? _cond_resched+0x1e/0x40
 [c10758cb] ? truncate_inode_pages_range+0x12b/0x440
 [c1075cb7] ? truncate_inode_pages+0x17/0x20
 [c10cf2ba] ? __blkdev_put+0x3a/0x140
 [c10d02db] ? blkdev_close+0x1b/0x40
 [c10a60b2] ? __fput+0x72/0x1c0
 [c1039461] ? task_work_run+0x61/0xa0
 [c1253b6f] ? work_notifysig+0x24/0x35
 
 This is caused by the fact that the LVM2 testsuite creates 64TB device.
 The kernel uses unsigned long to index pages in files and block devices,
 on 64TB device unsigned long overflows (it can address up to 16TB with
 4k pages), causing the infinite loop.

Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
allow 64 bit offsets for block devices on 32 bit.  It sounds like
there's somewhere not using sector_t ... or using it wrongly which needs
fixing.

 On 32-bit architectures, we must limit block device size to
 PAGE_SIZE*(2^32-1).

So you're saying CONFIG_LBDAF can never work, why?

James



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block devices: validate block device capacity

2014-01-30 Thread Mikulas Patocka


On Thu, 30 Jan 2014, James Bottomley wrote:

 Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
 allow 64 bit offsets for block devices on 32 bit.  It sounds like
 there's somewhere not using sector_t ... or using it wrongly which needs
 fixing.

The page cache uses unsigned long as a page index. Therefore, if unsigned 
long is 32-bit, the block device may have at most 2^32-1 pages.

  On 32-bit architectures, we must limit block device size to
  PAGE_SIZE*(2^32-1).
 
 So you're saying CONFIG_LBDAF can never work, why?
 
 James

CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, 
without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 
16TiB (4096*2^32).

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block devices: validate block device capacity

2014-01-30 Thread James Bottomley
On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote:
 
 On Thu, 30 Jan 2014, James Bottomley wrote:
 
  Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
  allow 64 bit offsets for block devices on 32 bit.  It sounds like
  there's somewhere not using sector_t ... or using it wrongly which needs
  fixing.
 
 The page cache uses unsigned long as a page index. Therefore, if unsigned 
 long is 32-bit, the block device may have at most 2^32-1 pages.

Um, that's the index into the mapping, not the device; a device can have
multiple mappings and each mapping has a radix tree of pages.  For most
filesystems a mapping is equivalent to a file, so we can have large
filesystems, but they can't have files over actually 4GB on 32 bits
otherwise mmap fails.

Are we running into a problems with struct address_space where we've
assumed the inode belongs to the file and lvm is doing something where
it's the whole device?

   On 32-bit architectures, we must limit block device size to
   PAGE_SIZE*(2^32-1).
  
  So you're saying CONFIG_LBDAF can never work, why?
  
  James
 
 CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, 
 without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 
 16TiB (4096*2^32).

I don't think the people who did the large block device work expected to
gain only 3 bits for all their pain.

James



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block devices: validate block device capacity

2014-01-30 Thread Mikulas Patocka


On Thu, 30 Jan 2014, James Bottomley wrote:

 On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote:
  
  On Thu, 30 Jan 2014, James Bottomley wrote:
  
   Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
   allow 64 bit offsets for block devices on 32 bit.  It sounds like
   there's somewhere not using sector_t ... or using it wrongly which needs
   fixing.
  
  The page cache uses unsigned long as a page index. Therefore, if unsigned 
  long is 32-bit, the block device may have at most 2^32-1 pages.
 
 Um, that's the index into the mapping, not the device; a device can have
 multiple mappings and each mapping has a radix tree of pages.  For most
 filesystems a mapping is equivalent to a file, so we can have large
 filesystems, but they can't have files over actually 4GB on 32 bits
 otherwise mmap fails.

A device may be accessed direcly (by opening /dev/sdX) and it creates a 
mapping too - thus, the size of a mapping limits the size of a block 
device.

The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may 
fix it - but there may be some hidden places where pgoff is converted to 
unsigned long - who knows, if they exist or not?

 Are we running into a problems with struct address_space where we've
 assumed the inode belongs to the file and lvm is doing something where
 it's the whole device?

lvm creates a 64TiB device, udev runs blkid on that device and blkid opens 
the device and gets stuck because of unsigned long overflow.

On 32-bit architectures, we must limit block device size to
PAGE_SIZE*(2^32-1).
   
   So you're saying CONFIG_LBDAF can never work, why?
   
   James
  
  CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, 
  without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 
  16TiB (4096*2^32).
 
 I don't think the people who did the large block device work expected to
 gain only 3 bits for all their pain.
 
 James

One could change it to have three choices:
2TiB limit - 32-bit sector_t and 32-bit pgoff_t
16TiB limit - 64-bit sector_t and 32-bit pgoff_t
32PiB limit - 64-bit sector_t and 64-bit pgoff_t

Though, we need to know if the people who designed memory management agree 
with changing pgoff_t to 64 bits.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block devices: validate block device capacity

2014-01-30 Thread James Bottomley
On Thu, 2014-01-30 at 19:20 -0500, Mikulas Patocka wrote:
 
 On Thu, 30 Jan 2014, James Bottomley wrote:
 
  On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote:
   
   On Thu, 30 Jan 2014, James Bottomley wrote:
   
Why is this?  the whole reason for CONFIG_LBDAF is supposed to be to
allow 64 bit offsets for block devices on 32 bit.  It sounds like
there's somewhere not using sector_t ... or using it wrongly which needs
fixing.
   
   The page cache uses unsigned long as a page index. Therefore, if unsigned 
   long is 32-bit, the block device may have at most 2^32-1 pages.
  
  Um, that's the index into the mapping, not the device; a device can have
  multiple mappings and each mapping has a radix tree of pages.  For most
  filesystems a mapping is equivalent to a file, so we can have large
  filesystems, but they can't have files over actually 4GB on 32 bits
  otherwise mmap fails.
 
 A device may be accessed direcly (by opening /dev/sdX) and it creates a 
 mapping too - thus, the size of a mapping limits the size of a block 
 device.

Right, that's what I suspected below.  We can't damage large block
support on filesystems just because of this corner case.

 The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may 
 fix it - but there may be some hidden places where pgoff is converted to 
 unsigned long - who knows, if they exist or not?

I don't think we want to do that ... it will make struct page fatter and
have knock on impacts in the radix tree code.  To fix this, we need to
make the corner case (i.e. opening large block devices without a
filesystem) bear the pain.  It sort of looks like we want to do a linear
array of mappings of 64TB for the device so the page cache calculations
don't overflow.

  Are we running into a problems with struct address_space where we've
  assumed the inode belongs to the file and lvm is doing something where
  it's the whole device?
 
 lvm creates a 64TiB device, udev runs blkid on that device and blkid opens 
 the device and gets stuck because of unsigned long overflow.

well a simple open won't cause this ... it must be trying to read the
end of the device for some reason.  But anyway, the way to fix this is
to fix the large block open as a corner case.

 On 32-bit architectures, we must limit block device size to
 PAGE_SIZE*(2^32-1).

So you're saying CONFIG_LBDAF can never work, why?

James
   
   CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, 
   without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 
   16TiB (4096*2^32).
  
  I don't think the people who did the large block device work expected to
  gain only 3 bits for all their pain.
  
  James
 
 One could change it to have three choices:
 2TiB limit - 32-bit sector_t and 32-bit pgoff_t
 16TiB limit - 64-bit sector_t and 32-bit pgoff_t
 32PiB limit - 64-bit sector_t and 64-bit pgoff_t
 
 Though, we need to know if the people who designed memory management agree 
 with changing pgoff_t to 64 bits.

I don't think we can change the size of pgoff_t ... because it won't
just be that, it will be other problems like the radix tree.

However, you also have to bear in mind that truncating large block
device support to 64TB on 32 bits is a technical ABI break.  Hopefully
it is only technical because I don't know of any current consumer block
device that is 64TB yet, but anyone who'd created a filesystem 64TB
would find it no-longer mounted on 32 bits.
James

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block devices: validate block device capacity

2014-01-30 Thread Mikulas Patocka


On Thu, 30 Jan 2014, James Bottomley wrote:

  A device may be accessed direcly (by opening /dev/sdX) and it creates a 
  mapping too - thus, the size of a mapping limits the size of a block 
  device.
 
 Right, that's what I suspected below.  We can't damage large block
 support on filesystems just because of this corner case.

Devices larger than 16TiB never worked on 32-bit kernel, so this patch 
isn't damaging anything.

Note that if you attach a 16TiB block device, don't open it and mount it, 
it still won't work, because the buffer cache uses the page cache (see the 
function __find_get_block_slow and the variable pgoff_t index - that 
variable would overflow if the filesystem accessed a buffer beyond 16TiB).

  The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may 
  fix it - but there may be some hidden places where pgoff is converted to 
  unsigned long - who knows, if they exist or not?
 
 I don't think we want to do that ... it will make struct page fatter and
 have knock on impacts in the radix tree code.  To fix this, we need to
 make the corner case (i.e. opening large block devices without a
 filesystem) bear the pain.  It sort of looks like we want to do a linear
 array of mappings of 64TB for the device so the page cache calculations
 don't overflow.

The code that reads and writes data to block devices and files is shared - 
the functions in mm/filemap.c work for both files and block devices.

So, if you want 64-bit page offsets, you need to increase pgoff_t size, 
and that will increase the limit for both files and block devices.

You shouldn't have separate functions for managing pages on files and 
separate functions for managing pages on block devices - that would 
increase code size and cause maintenance problems.

  Though, we need to know if the people who designed memory management agree 
  with changing pgoff_t to 64 bits.
 
 I don't think we can change the size of pgoff_t ... because it won't
 just be that, it will be other problems like the radix tree.

If we can't change it, then we must stay with the current 16TiB limit. 
There's no other way.

 However, you also have to bear in mind that truncating large block
 device support to 64TB on 32 bits is a technical ABI break.  Hopefully
 it is only technical because I don't know of any current consumer block
 device that is 64TB yet, but anyone who'd created a filesystem 64TB
 would find it no-longer mounted on 32 bits.
 James

It is not ABI break, because block devices larger than 16TiB never worked 
on 32-bit architectures. So it's better to refuse them outright, than to 
cause subtle lockups or data corruption.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html