Re: [PATCH] block devices: validate block device capacity
On Fri, Jan 31, 2014 at 03:20:17AM -0500, Mikulas Patocka wrote: So if you think you can support 16TiB devices and leave pgoff_t 32-bit, send a patch that does it. Until you make it, you should apply the patch that I sent, that prevents kernel lockups or data corruption when the user uses 16TiB device on 32-bit kernel. Exactly. I had actually looked into support for 16TiB devices for a NAS use case a while ago, but when explaining the effort involves the idea was dropped quickly. The Linux block device is too deeply tied to the pagecache to make it easily feasible. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block devices: validate block device capacity
On Mon, 3 Feb 2014, Christoph Hellwig wrote: On Fri, Jan 31, 2014 at 03:20:17AM -0500, Mikulas Patocka wrote: So if you think you can support 16TiB devices and leave pgoff_t 32-bit, send a patch that does it. Until you make it, you should apply the patch that I sent, that prevents kernel lockups or data corruption when the user uses 16TiB device on 32-bit kernel. Exactly. I had actually looked into support for 16TiB devices for a NAS use case a while ago, but when explaining the effort involves the idea was dropped quickly. The Linux block device is too deeply tied to the pagecache to make it easily feasible. The memory management routines use pgoff_t, so we could define pgoff_t to be 64-bit type. But there is lib/radix_tree.c that uses unsigned long as an index into the radix tree - and pgoff_t is cast to unsigned long when calling the radix_tree routines - so we'd need to change lib/radix_tree to use pgoff_t. Then, there may be other places where pgoff_t is cast to unsigned long and they are not trivial to find (one could enable some extra compiler warnings about truncating values when casting them, but I suppose, this would trigger a lot of false positives). This needs some deep review by people who designed the memory management code. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block devices: validate block device capacity
On Thu, 30 Jan 2014, James Bottomley wrote: So, if you want 64-bit page offsets, you need to increase pgoff_t size, and that will increase the limit for both files and block devices. No. The point is the page cache mapping of the device uses a manufactured inode saved in the backing device. It looks fixable in the buffer code before the page cache gets involved. So if you think you can support 16TiB devices and leave pgoff_t 32-bit, send a patch that does it. Until you make it, you should apply the patch that I sent, that prevents kernel lockups or data corruption when the user uses 16TiB device on 32-bit kernel. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] block devices: validate block device capacity
When running the LVM2 testsuite on 32-bit kernel, there are unkillable processes stuck in the kernel consuming 100% CPU: blkid R running 0 2005 1409 0x0004 ce009d00 0082 ffcf c11280ba 0060 560b5dfd 3111 00fe41cb ce009d00 d51cfeb0 001e 0002 0002 c10748c1 0002 c106cca4 Call Trace: [c11280ba] ? radix_tree_next_chunk+0xda/0x2c0 [c10748c1] ? release_pages+0x61/0x160 [c106cca4] ? find_get_pages+0x84/0x100 [c1251fbe] ? _cond_resched+0x1e/0x40 [c10758cb] ? truncate_inode_pages_range+0x12b/0x440 [c1075cb7] ? truncate_inode_pages+0x17/0x20 [c10cf2ba] ? __blkdev_put+0x3a/0x140 [c10d02db] ? blkdev_close+0x1b/0x40 [c10a60b2] ? __fput+0x72/0x1c0 [c1039461] ? task_work_run+0x61/0xa0 [c1253b6f] ? work_notifysig+0x24/0x35 This is caused by the fact that the LVM2 testsuite creates 64TB device. The kernel uses unsigned long to index pages in files and block devices, on 64TB device unsigned long overflows (it can address up to 16TB with 4k pages), causing the infinite loop. On 32-bit architectures, we must limit block device size to PAGE_SIZE*(2^32-1). The bug with untested device size is pervasive across the whole kernel, some drivers test that the device size fits in sector_t, but this test is not sufficient on 32-bit architectures. This patch introduces a new function validate_disk_capacity that tests if the disk capacity is OK for the current kernel and modifies the drivers brd, ide-gd, dm, sd to use it. Signed-off-by: Mikulas Patocka mpato...@redhat.com --- block/genhd.c | 23 +++ drivers/block/brd.c | 15 +++ drivers/ide/ide-gd.c |8 drivers/md/dm-ioctl.c |3 +-- drivers/md/dm-table.c | 14 +- drivers/scsi/sd.c | 20 +++- include/linux/device-mapper.h |2 +- include/linux/genhd.h |2 ++ 8 files changed, 70 insertions(+), 17 deletions(-) Index: linux-2.6-compile/block/genhd.c === --- linux-2.6-compile.orig/block/genhd.c2014-01-30 17:23:15.0 +0100 +++ linux-2.6-compile/block/genhd.c 2014-01-30 19:28:42.0 +0100 @@ -1835,3 +1835,26 @@ static void disk_release_events(struct g WARN_ON_ONCE(disk-ev disk-ev-block != 1); kfree(disk-ev); } + +int validate_disk_capacity(u64 n_sectors, const char **reason) +{ + u64 n_pages; + if (n_sectors 9 9 != n_sectors) { + if (reason) + *reason = The number of bytes is greater than 2^64.; + return -EOVERFLOW; + } + n_pages = (n_sectors + (1 (PAGE_SHIFT - 9)) - 1) (PAGE_SHIFT - 9); + if (n_pages ULONG_MAX) { + if (reason) + *reason = Use 64-bit kernel.; + return -EFBIG; + } + if (n_sectors != (sector_t)n_sectors) { + if (reason) + *reason = Use a kernel compiled with support for large block devices.; + return -ENOSPC; + } + return 0; +} +EXPORT_SYMBOL(validate_disk_capacity); Index: linux-2.6-compile/drivers/block/brd.c === --- linux-2.6-compile.orig/drivers/block/brd.c 2014-01-30 17:23:15.0 +0100 +++ linux-2.6-compile/drivers/block/brd.c 2014-01-30 19:26:51.0 +0100 @@ -429,12 +429,12 @@ static const struct block_device_operati * And now the modules code and kernel interface. */ static int rd_nr; -int rd_size = CONFIG_BLK_DEV_RAM_SIZE; +static unsigned rd_size = CONFIG_BLK_DEV_RAM_SIZE; static int max_part; static int part_shift; module_param(rd_nr, int, S_IRUGO); MODULE_PARM_DESC(rd_nr, Maximum number of brd devices); -module_param(rd_size, int, S_IRUGO); +module_param(rd_size, uint, S_IRUGO); MODULE_PARM_DESC(rd_size, Size of each RAM disk in kbytes.); module_param(max_part, int, S_IRUGO); MODULE_PARM_DESC(max_part, Maximum number of partitions per RAM disk); @@ -446,7 +446,7 @@ MODULE_ALIAS(rd); /* Legacy boot options - nonmodular */ static int __init ramdisk_size(char *str) { - rd_size = simple_strtol(str, NULL, 0); + rd_size = simple_strtoul(str, NULL, 0); return 1; } __setup(ramdisk_size=, ramdisk_size); @@ -463,6 +463,13 @@ static struct brd_device *brd_alloc(int { struct brd_device *brd; struct gendisk *disk; + u64 capacity = (u64)rd_size * 2; + const char *reason; + + if (validate_disk_capacity(capacity, reason)) { + printk(KERN_ERR brd: disk is too big: %s\n, reason); + goto out; + } brd = kzalloc(sizeof(*brd), GFP_KERNEL); if (!brd) @@ -493,7 +500,7 @@ static struct brd_device *brd_alloc(int disk-queue = brd-brd_queue; disk-flags |=
Re: [PATCH] block devices: validate block device capacity
On Thu, 2014-01-30 at 15:40 -0500, Mikulas Patocka wrote: When running the LVM2 testsuite on 32-bit kernel, there are unkillable processes stuck in the kernel consuming 100% CPU: blkid R running 0 2005 1409 0x0004 ce009d00 0082 ffcf c11280ba 0060 560b5dfd 3111 00fe41cb ce009d00 d51cfeb0 001e 0002 0002 c10748c1 0002 c106cca4 Call Trace: [c11280ba] ? radix_tree_next_chunk+0xda/0x2c0 [c10748c1] ? release_pages+0x61/0x160 [c106cca4] ? find_get_pages+0x84/0x100 [c1251fbe] ? _cond_resched+0x1e/0x40 [c10758cb] ? truncate_inode_pages_range+0x12b/0x440 [c1075cb7] ? truncate_inode_pages+0x17/0x20 [c10cf2ba] ? __blkdev_put+0x3a/0x140 [c10d02db] ? blkdev_close+0x1b/0x40 [c10a60b2] ? __fput+0x72/0x1c0 [c1039461] ? task_work_run+0x61/0xa0 [c1253b6f] ? work_notifysig+0x24/0x35 This is caused by the fact that the LVM2 testsuite creates 64TB device. The kernel uses unsigned long to index pages in files and block devices, on 64TB device unsigned long overflows (it can address up to 16TB with 4k pages), causing the infinite loop. Why is this? the whole reason for CONFIG_LBDAF is supposed to be to allow 64 bit offsets for block devices on 32 bit. It sounds like there's somewhere not using sector_t ... or using it wrongly which needs fixing. On 32-bit architectures, we must limit block device size to PAGE_SIZE*(2^32-1). So you're saying CONFIG_LBDAF can never work, why? James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block devices: validate block device capacity
On Thu, 30 Jan 2014, James Bottomley wrote: Why is this? the whole reason for CONFIG_LBDAF is supposed to be to allow 64 bit offsets for block devices on 32 bit. It sounds like there's somewhere not using sector_t ... or using it wrongly which needs fixing. The page cache uses unsigned long as a page index. Therefore, if unsigned long is 32-bit, the block device may have at most 2^32-1 pages. On 32-bit architectures, we must limit block device size to PAGE_SIZE*(2^32-1). So you're saying CONFIG_LBDAF can never work, why? James CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 16TiB (4096*2^32). Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block devices: validate block device capacity
On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote: On Thu, 30 Jan 2014, James Bottomley wrote: Why is this? the whole reason for CONFIG_LBDAF is supposed to be to allow 64 bit offsets for block devices on 32 bit. It sounds like there's somewhere not using sector_t ... or using it wrongly which needs fixing. The page cache uses unsigned long as a page index. Therefore, if unsigned long is 32-bit, the block device may have at most 2^32-1 pages. Um, that's the index into the mapping, not the device; a device can have multiple mappings and each mapping has a radix tree of pages. For most filesystems a mapping is equivalent to a file, so we can have large filesystems, but they can't have files over actually 4GB on 32 bits otherwise mmap fails. Are we running into a problems with struct address_space where we've assumed the inode belongs to the file and lvm is doing something where it's the whole device? On 32-bit architectures, we must limit block device size to PAGE_SIZE*(2^32-1). So you're saying CONFIG_LBDAF can never work, why? James CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 16TiB (4096*2^32). I don't think the people who did the large block device work expected to gain only 3 bits for all their pain. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block devices: validate block device capacity
On Thu, 30 Jan 2014, James Bottomley wrote: On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote: On Thu, 30 Jan 2014, James Bottomley wrote: Why is this? the whole reason for CONFIG_LBDAF is supposed to be to allow 64 bit offsets for block devices on 32 bit. It sounds like there's somewhere not using sector_t ... or using it wrongly which needs fixing. The page cache uses unsigned long as a page index. Therefore, if unsigned long is 32-bit, the block device may have at most 2^32-1 pages. Um, that's the index into the mapping, not the device; a device can have multiple mappings and each mapping has a radix tree of pages. For most filesystems a mapping is equivalent to a file, so we can have large filesystems, but they can't have files over actually 4GB on 32 bits otherwise mmap fails. A device may be accessed direcly (by opening /dev/sdX) and it creates a mapping too - thus, the size of a mapping limits the size of a block device. The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may fix it - but there may be some hidden places where pgoff is converted to unsigned long - who knows, if they exist or not? Are we running into a problems with struct address_space where we've assumed the inode belongs to the file and lvm is doing something where it's the whole device? lvm creates a 64TiB device, udev runs blkid on that device and blkid opens the device and gets stuck because of unsigned long overflow. On 32-bit architectures, we must limit block device size to PAGE_SIZE*(2^32-1). So you're saying CONFIG_LBDAF can never work, why? James CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 16TiB (4096*2^32). I don't think the people who did the large block device work expected to gain only 3 bits for all their pain. James One could change it to have three choices: 2TiB limit - 32-bit sector_t and 32-bit pgoff_t 16TiB limit - 64-bit sector_t and 32-bit pgoff_t 32PiB limit - 64-bit sector_t and 64-bit pgoff_t Though, we need to know if the people who designed memory management agree with changing pgoff_t to 64 bits. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block devices: validate block device capacity
On Thu, 2014-01-30 at 19:20 -0500, Mikulas Patocka wrote: On Thu, 30 Jan 2014, James Bottomley wrote: On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote: On Thu, 30 Jan 2014, James Bottomley wrote: Why is this? the whole reason for CONFIG_LBDAF is supposed to be to allow 64 bit offsets for block devices on 32 bit. It sounds like there's somewhere not using sector_t ... or using it wrongly which needs fixing. The page cache uses unsigned long as a page index. Therefore, if unsigned long is 32-bit, the block device may have at most 2^32-1 pages. Um, that's the index into the mapping, not the device; a device can have multiple mappings and each mapping has a radix tree of pages. For most filesystems a mapping is equivalent to a file, so we can have large filesystems, but they can't have files over actually 4GB on 32 bits otherwise mmap fails. A device may be accessed direcly (by opening /dev/sdX) and it creates a mapping too - thus, the size of a mapping limits the size of a block device. Right, that's what I suspected below. We can't damage large block support on filesystems just because of this corner case. The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may fix it - but there may be some hidden places where pgoff is converted to unsigned long - who knows, if they exist or not? I don't think we want to do that ... it will make struct page fatter and have knock on impacts in the radix tree code. To fix this, we need to make the corner case (i.e. opening large block devices without a filesystem) bear the pain. It sort of looks like we want to do a linear array of mappings of 64TB for the device so the page cache calculations don't overflow. Are we running into a problems with struct address_space where we've assumed the inode belongs to the file and lvm is doing something where it's the whole device? lvm creates a 64TiB device, udev runs blkid on that device and blkid opens the device and gets stuck because of unsigned long overflow. well a simple open won't cause this ... it must be trying to read the end of the device for some reason. But anyway, the way to fix this is to fix the large block open as a corner case. On 32-bit architectures, we must limit block device size to PAGE_SIZE*(2^32-1). So you're saying CONFIG_LBDAF can never work, why? James CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86, without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is 16TiB (4096*2^32). I don't think the people who did the large block device work expected to gain only 3 bits for all their pain. James One could change it to have three choices: 2TiB limit - 32-bit sector_t and 32-bit pgoff_t 16TiB limit - 64-bit sector_t and 32-bit pgoff_t 32PiB limit - 64-bit sector_t and 64-bit pgoff_t Though, we need to know if the people who designed memory management agree with changing pgoff_t to 64 bits. I don't think we can change the size of pgoff_t ... because it won't just be that, it will be other problems like the radix tree. However, you also have to bear in mind that truncating large block device support to 64TB on 32 bits is a technical ABI break. Hopefully it is only technical because I don't know of any current consumer block device that is 64TB yet, but anyone who'd created a filesystem 64TB would find it no-longer mounted on 32 bits. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block devices: validate block device capacity
On Thu, 30 Jan 2014, James Bottomley wrote: A device may be accessed direcly (by opening /dev/sdX) and it creates a mapping too - thus, the size of a mapping limits the size of a block device. Right, that's what I suspected below. We can't damage large block support on filesystems just because of this corner case. Devices larger than 16TiB never worked on 32-bit kernel, so this patch isn't damaging anything. Note that if you attach a 16TiB block device, don't open it and mount it, it still won't work, because the buffer cache uses the page cache (see the function __find_get_block_slow and the variable pgoff_t index - that variable would overflow if the filesystem accessed a buffer beyond 16TiB). The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may fix it - but there may be some hidden places where pgoff is converted to unsigned long - who knows, if they exist or not? I don't think we want to do that ... it will make struct page fatter and have knock on impacts in the radix tree code. To fix this, we need to make the corner case (i.e. opening large block devices without a filesystem) bear the pain. It sort of looks like we want to do a linear array of mappings of 64TB for the device so the page cache calculations don't overflow. The code that reads and writes data to block devices and files is shared - the functions in mm/filemap.c work for both files and block devices. So, if you want 64-bit page offsets, you need to increase pgoff_t size, and that will increase the limit for both files and block devices. You shouldn't have separate functions for managing pages on files and separate functions for managing pages on block devices - that would increase code size and cause maintenance problems. Though, we need to know if the people who designed memory management agree with changing pgoff_t to 64 bits. I don't think we can change the size of pgoff_t ... because it won't just be that, it will be other problems like the radix tree. If we can't change it, then we must stay with the current 16TiB limit. There's no other way. However, you also have to bear in mind that truncating large block device support to 64TB on 32 bits is a technical ABI break. Hopefully it is only technical because I don't know of any current consumer block device that is 64TB yet, but anyone who'd created a filesystem 64TB would find it no-longer mounted on 32 bits. James It is not ABI break, because block devices larger than 16TiB never worked on 32-bit architectures. So it's better to refuse them outright, than to cause subtle lockups or data corruption. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html