[RFC PATCH] block: introduce poison tracking for block devices
This patch copies the badblock management code from md-raid to use it for tracking bad/'poison' sectors on a per-block device level. NVDIMM devices, which behave more like DRAM, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Signed-off-by: Vishal Verma --- This really is a copy-paste + a few modifications of the badblock management code + sysfs representation from md. In this RFC, I want to make sure this path sounds acceptable for the use case described above, for NVDIMMs. Eventually, I think the md badblock management and this should be refactored to use the same code - I think this should be easy to do: - move the badblocks struct and associated functions into a header file (along the lines of include/linux/list.h) - embed the structure into whatever needs to use this list (in case of md, this would be 'rdev', in the nvdimm case, the gendisk) - call the functions from badblocks.h as needed to manipulate the list. - The sysfs show/store functions in badblocks.h would be generic variants, with wrappers being present in md and gendisk to fit into their respective sysfs layouts If this looks generally reasonable, I'll post a v2 with this refactoring done. block/genhd.c | 502 ++ include/linux/genhd.h | 26 +++ 2 files changed, 528 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index 0c706f3..de99d28 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -23,6 +23,15 @@ #include "blk.h" +#define BB_LEN_MASK(0x01FFULL) +#define BB_OFFSET_MASK (0x7E00ULL) +#define BB_ACK_MASK(0x8000ULL) +#define BB_MAX_LEN 512 +#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9) +#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1) +#define BB_ACK(x) (!!((x) & BB_ACK_MASK)) +#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63)) + static DEFINE_MUTEX(block_class_lock); struct kobject *block_depr; @@ -670,6 +679,496 @@ void del_gendisk(struct gendisk *disk) } EXPORT_SYMBOL(del_gendisk); +int disk_poison_list_init(struct gendisk *disk) +{ + disk->plist = kmalloc(sizeof(*disk->plist), GFP_KERNEL); + if (!disk->plist) + return -ENOMEM; + disk->plist->count = 0; + disk->plist->shift = 0; + disk->plist->page = kmalloc(PAGE_SIZE, GFP_KERNEL); + seqlock_init(&disk->plist->lock); + if (disk->plist->page == NULL) + return -ENOMEM; + + return 0; +} +EXPORT_SYMBOL(disk_poison_list_init); + +/* Bad block management. + * We can record which blocks on each device are 'bad' and so just + * fail those blocks, or that stripe, rather than the whole device. + * Entries in the bad-block table are 64bits wide. This comprises: + * Length of bad-range, in sectors: 0-511 for lengths 1-512 + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) + * A 'shift' can be set so that larger blocks are tracked and + * consequently larger devices can be covered. + * 'Acknowledged' flag - 1 bit. - the most significant bit. + * + * Locking of the bad-block table uses a seqlock so md_is_badblock + * might need to retry if it is very unlucky. + * We will sometimes want to check for bad blocks in a bi_end_io function, + * so we use the write_seqlock_irq variant. + * + * When looking for a bad block we specify a range and want to + * know if any block in the range is bad. So we binary-search + * to the last range that starts at-or-before the given endpoint, + * (or "before the sector after the target range") + * then see if it ends after the given start. + * We return + * 0 if there are no known bad blocks in the range + * 1 if there are known bad block which are all acknowledged + * -1 if there are bad blocks which have not yet been acknowledged in metadata. + * plus the start/length of the first bad section we overlap. + */ +int disk_check_poison(struct gendisk *disk, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + struct disk_poison *bb = disk->plist; + int hi; + int lo; + u64 *p = bb->page; + int rv; + sector_t target = s + sectors; + unsigned seq; + + if (bb->shift > 0) { + /* round the start down, and the end up */ + s >>= bb->shift; + target += (1<shift) - 1; + target >>= bb->shift; + sect
[PATCH 2/3] block: Add badblock management for gendisks
NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: Vishal Verma --- block/genhd.c | 64 +++ include/linux/genhd.h | 6 + 2 files changed, 70 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index 0c706f3..4209c32 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "blk.h" @@ -505,6 +506,20 @@ static int exact_lock(dev_t devt, void *data) return 0; } +static void disk_alloc_badblocks(struct gendisk *disk) +{ + disk->bb = kzalloc(sizeof(disk->bb), GFP_KERNEL); + if (!disk->bb) { + pr_warn("%s: failed to allocate space for badblocks\n", + disk->disk_name); + return; + } + + if (badblocks_init(disk->bb, 1)) + pr_warn("%s: failed to initialize badblocks\n", + disk->disk_name); +} + static void register_disk(struct gendisk *disk) { struct device *ddev = disk_to_dev(disk); @@ -609,6 +624,7 @@ void add_disk(struct gendisk *disk) disk->first_minor = MINOR(devt); disk_alloc_events(disk); + disk_alloc_badblocks(disk); /* Register BDI before referencing it from bdev */ bdi = &disk->queue->backing_dev_info; @@ -657,6 +673,9 @@ void del_gendisk(struct gendisk *disk) blk_unregister_queue(disk); blk_unregister_region(disk_devt(disk), disk->minors); + badblocks_free(disk->bb); + kfree(disk->bb); + part_stat_set_all(&disk->part0, 0); disk->part0.stamp = 0; @@ -670,6 +689,48 @@ void del_gendisk(struct gendisk *disk) } EXPORT_SYMBOL(del_gendisk); +/* + * The gendisk usage of badblocks does not track acknowledgements for + * badblocks. We always assume they are acknowledged. + */ +int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors); +} +EXPORT_SYMBOL(disk_check_badblocks); + +int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + return badblocks_set(disk->bb, s, sectors, 1); +} +EXPORT_SYMBOL(disk_set_badblocks); + +int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + return badblocks_clear(disk->bb, s, sectors); +} +EXPORT_SYMBOL(disk_clear_badblocks); + +/* sysfs access to bad-blocks list. */ +static ssize_t disk_badblocks_show(struct device *dev, + struct device_attribute *attr, + char *page) +{ + struct gendisk *disk = dev_to_disk(dev); + + return badblocks_show(disk->bb, page, 0); +} + +static ssize_t disk_badblocks_store(struct device *dev, + struct device_attribute *attr, + const char *page, size_t len) +{ + struct gendisk *disk = dev_to_disk(dev); + + return badblocks_store(disk->bb, page, len, 0); +} + /** * get_gendisk - get partitioning information for a given device * @devt: device to get partitioning information for @@ -988,6 +1049,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show, static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL); static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL); +static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show, + disk_badblocks_store); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store); @@ -1009,6 +1072,7 @@ static struct attribute *disk_attrs[] = { &dev_attr_capability.attr, &dev_attr_stat.attr, &dev_attr_inflight.attr, + &dev_attr_badblocks.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 2adbfa6..5563bde 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -162,6 +162,7 @@ struct disk_part_tbl { }; struct disk_events; +struct badblocks; struct gendisk {
[PATCH 0/3] Badblock tracking for gendisks
Patch 1 copies badblock management code into a header of its own, making it generally available. It follows common libraries of code such as linked lists, where anyone may embed a core data structure in another place, and use the provided accessor functions to manipulate the data. Patch 2 adds badblock tracking to gendisks (in preparation for use by NVDIMM devices). Right now, it is turned on unconditionally - I'd appreciate comments on if that is the right path. Patch 3 converts md over to use the new badblocks 'library'. I have done some pretty simple testing on this - created a raid 1 device, made sure the sysfs entries show up, and can be used to add and view badblocks. A closer look by the md folks would be nice here. Vishal Verma (3): badblocks: Add core badblock management code block: Add badblock management for gendisks md: convert to use the generic badblocks code block/genhd.c | 64 ++ drivers/md/md.c | 495 ++-- drivers/md/md.h | 31 +-- include/linux/badblocks.h | 512 ++ include/linux/genhd.h | 6 + 5 files changed, 603 insertions(+), 505 deletions(-) create mode 100644 include/linux/badblocks.h -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] md: convert to use the generic badblocks code
Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma --- drivers/md/md.c | 495 +++- drivers/md/md.h | 31 +--- 2 files changed, 21 insertions(+), 505 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index c702de1..82994d7 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -34,6 +34,7 @@ #include #include +#include #include #include #include @@ -1358,8 +1359,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb) return cpu_to_le32(csum); } -static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors, - int acknowledged); static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version) { struct mdp_superblock_1 *sb; @@ -1484,7 +1483,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ count <<= sb->bblog_shift; if (bb + 1 == 0) break; - if (md_set_badblocks(&rdev->badblocks, + if (badblocks_set(&rdev->badblocks, sector, count, 1) == 0) return -EINVAL; } @@ -2226,7 +2225,7 @@ repeat: rdev_for_each(rdev, mddev) { if (rdev->badblocks.changed) { rdev->badblocks.changed = 0; - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); md_error(mddev, rdev); } clear_bit(Blocked, &rdev->flags); @@ -2352,7 +2351,7 @@ repeat: clear_bit(Blocked, &rdev->flags); if (any_badblocks_changed) - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); clear_bit(BlockedBadBlocks, &rdev->flags); wake_up(&rdev->blocked_wait); } @@ -2944,11 +2943,17 @@ static ssize_t recovery_start_store(struct md_rdev *rdev, const char *buf, size_ static struct rdev_sysfs_entry rdev_recovery_start = __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store); -static ssize_t -badblocks_show(struct badblocks *bb, char *page, int unack); -static ssize_t -badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack); - +/* sysfs access to bad-blocks list. + * We present two files. + * 'bad-blocks' lists sector numbers and lengths of ranges that + *are recorded as bad. The list is truncated to fit within + *the one-page limit of sysfs. + *Writing "sector length" to this file adds an acknowledged + *bad block list. + * 'unacknowledged-bad-blocks' lists bad blocks that have not yet + *been acknowledged. Writing to this file adds bad blocks + *without acknowledging them. This is largely for testing. + */ static ssize_t bb_show(struct md_rdev *rdev, char *page) { return badblocks_show(&rdev->badblocks, page, 0); @@ -8348,253 +8353,7 @@ void md_finish_reshape(struct mddev *mddev) } EXPORT_SYMBOL(md_finish_reshape); -/* Bad block management. - * We can record which blocks on each device are 'bad' and so just - * fail those blocks, or that stripe, rather than the whole device. - * Entries in the bad-block table are 64bits wide. This comprises: - * Length of bad-range, in sectors: 0-511 for lengths 1-512 - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) - * A 'shift' can be set so that larger blocks are tracked and - * consequently larger devices can be covered. - * 'Acknowledged' flag - 1 bit. - the most significant bit. - * - * Locking of the bad-block table uses a seqlock so md_is_badblock - * might need to retry if it is very unlucky. - * We will sometimes want to check for bad blocks in a bi_end_io function, - * so we use the write_seqlock_irq variant. - * - * When looking for a bad block we specify a range and want to - * know if any block in the range is bad. So we binary-search - * to the last range that starts at-or-before the given endpoint, - * (or "before the sector after the target range") - * then see if it ends after the given start. - * We return - * 0 if there are no known bad blocks in the range - * 1 if there are known bad block which are all acknowledged - * -1 if there are bad blocks which have not yet been acknowledged in metadata. - * plus the start/length of the first bad section we overlap. - */ -int md_is_badblo
[PATCH 1/3] badblocks: Add core badblock management code
Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: Vishal Verma --- include/linux/badblocks.h | 512 ++ 1 file changed, 512 insertions(+) create mode 100644 include/linux/badblocks.h diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h new file mode 100644 index 000..94fa348 --- /dev/null +++ b/include/linux/badblocks.h @@ -0,0 +1,512 @@ +#ifndef _LINUX_BADBLOCKS_H +#define _LINUX_BADBLOCKS_H + +#include +#include +#include + +#define BB_LEN_MASK(0x01FFULL) +#define BB_OFFSET_MASK (0x7E00ULL) +#define BB_ACK_MASK(0x8000ULL) +#define BB_MAX_LEN 512 +#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9) +#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1) +#define BB_ACK(x) (!!((x) & BB_ACK_MASK)) +#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63)) + +/* Bad block numbers are stored sorted in a single page. + * 64bits is used for each block or extent. + * 54 bits are sector number, 9 bits are extent size, + * 1 bit is an 'acknowledged' flag. + */ +#define MAX_BADBLOCKS (PAGE_SIZE/8) + +struct badblocks { + int count; /* count of bad blocks */ + int unacked_exist; /* there probably are unacknowledged +* bad blocks. This is only cleared +* when a read discovers none +*/ + int shift; /* shift from sectors to block size +* a -ve shift means badblocks are +* disabled.*/ + u64 *page; /* badblock list */ + int changed; + seqlock_t lock; + sector_t sector; + sector_t size; /* in sectors */ +}; + +/* Bad block management. + * We can record which blocks on each device are 'bad' and so just + * fail those blocks, or that stripe, rather than the whole device. + * Entries in the bad-block table are 64bits wide. This comprises: + * Length of bad-range, in sectors: 0-511 for lengths 1-512 + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) + * A 'shift' can be set so that larger blocks are tracked and + * consequently larger devices can be covered. + * 'Acknowledged' flag - 1 bit. - the most significant bit. + * + * Locking of the bad-block table uses a seqlock so badblocks_check + * might need to retry if it is very unlucky. + * We will sometimes want to check for bad blocks in a bi_end_io function, + * so we use the write_seqlock_irq variant. + * + * When looking for a bad block we specify a range and want to + * know if any block in the range is bad. So we binary-search + * to the last range that starts at-or-before the given endpoint, + * (or "before the sector after the target range") + * then see if it ends after the given start. + * We return + * 0 if there are no known bad blocks in the range + * 1 if there are known bad block which are all acknowledged + * -1 if there are bad blocks which have not yet been acknowledged in metadata. + * plus the start/length of the first bad section we overlap. + */ +static inline int badblocks_check(struct badblocks *bb, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + int hi; + int lo; + u64 *p = bb->page; + int rv; + sector_t target = s + sectors; + unsigned seq; + + if (bb->shift > 0) { + /* round the start down, and the end up */ + s >>= bb->shift; + target += (1<shift) - 1; + target >>= bb->shift; + sectors = target - s; + } + /* 'target' is now the first block after the bad range */ + +retry: + seq = read_seqbegin(&bb->lock); + lo = 0; + rv = 0; + hi = bb->count; + + /* Binary search between lo and hi for 'target' +* i.e. for the last range that starts before 'target' +*/ + /* INVARIANT: ranges before 'lo' and at-or-after 'hi' +* are known not to be the last range before target. +* VARIANT: hi-lo is the number of possible +* ranges, and decreases until it reaches 1 +*/ + while (hi - lo > 1) { + int mid = (lo + hi) / 2; + sector_t a = BB_OFFSET(p[mid]); + if (a < target) + /* This could still
[PATCH v2 2/3] block: Add badblock management for gendisks
NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: Vishal Verma --- block/genhd.c | 81 +++ include/linux/genhd.h | 6 2 files changed, 87 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index 0c706f3..84fd65c 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "blk.h" @@ -505,6 +506,20 @@ static int exact_lock(dev_t devt, void *data) return 0; } +static void disk_alloc_badblocks(struct gendisk *disk) +{ + disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL); + if (!disk->bb) { + pr_warn("%s: failed to allocate space for badblocks\n", + disk->disk_name); + return; + } + + if (badblocks_init(disk->bb, 1)) + pr_warn("%s: failed to initialize badblocks\n", + disk->disk_name); +} + static void register_disk(struct gendisk *disk) { struct device *ddev = disk_to_dev(disk); @@ -609,6 +624,7 @@ void add_disk(struct gendisk *disk) disk->first_minor = MINOR(devt); disk_alloc_events(disk); + disk_alloc_badblocks(disk); /* Register BDI before referencing it from bdev */ bdi = &disk->queue->backing_dev_info; @@ -657,6 +673,11 @@ void del_gendisk(struct gendisk *disk) blk_unregister_queue(disk); blk_unregister_region(disk_devt(disk), disk->minors); + if (disk->bb) { + badblocks_free(disk->bb); + kfree(disk->bb); + } + part_stat_set_all(&disk->part0, 0); disk->part0.stamp = 0; @@ -670,6 +691,63 @@ void del_gendisk(struct gendisk *disk) } EXPORT_SYMBOL(del_gendisk); +/* + * The gendisk usage of badblocks does not track acknowledgements for + * badblocks. We always assume they are acknowledged. + */ +int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors); +} +EXPORT_SYMBOL(disk_check_badblocks); + +int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_set(disk->bb, s, sectors, 1); +} +EXPORT_SYMBOL(disk_set_badblocks); + +int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_clear(disk->bb, s, sectors); +} +EXPORT_SYMBOL(disk_clear_badblocks); + +/* sysfs access to bad-blocks list. */ +static ssize_t disk_badblocks_show(struct device *dev, + struct device_attribute *attr, + char *page) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (!disk->bb) + return 0; + + return badblocks_show(disk->bb, page, 0); +} + +static ssize_t disk_badblocks_store(struct device *dev, + struct device_attribute *attr, + const char *page, size_t len) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (!disk->bb) + return 0; + + return badblocks_store(disk->bb, page, len, 0); +} + /** * get_gendisk - get partitioning information for a given device * @devt: device to get partitioning information for @@ -988,6 +1066,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show, static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL); static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL); +static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show, + disk_badblocks_store); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store); @@ -1009,6 +1089,7 @@ static struct attribute *disk_attrs[] = { &dev_attr_capability.attr, &dev_attr_stat.attr, &dev_attr_inflight.attr, + &dev_attr_badblocks.attr, #i
[PATCH v2 0/3] Badblock tracking for gendisks
v2: - In badblocks_free, make 'page' NULL (patch 1) - Move the core badblocks code to a new .c file (patch 1) (Jens) - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan) - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the genhd wrappers (patch 2) (Jeff) - Update the md conversion to also ise the badblocks init and free functions (patch 3) - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3) Patch 1 copies badblock management code into a header of its own, making it generally available. It follows common libraries of code such as linked lists, where anyone may embed a core data structure in another place, and use the provided accessor functions to manipulate the data. Patch 2 adds badblock tracking to gendisks (in preparation for use by NVDIMM devices). Right now, it is turned on unconditionally - I'd appreciate comments on if that is the right path. Patch 3 converts md over to use the new badblocks 'library'. I have done some pretty simple testing on this - created a raid 1 device, made sure the sysfs entries show up, and can be used to add and view badblocks. A closer look by the md folks would be nice here. Vishal Verma (3): badblocks: Add core badblock management code block: Add badblock management for gendisks md: convert to use the generic badblocks code block/Makefile| 2 +- block/badblocks.c | 523 ++ block/genhd.c | 81 +++ drivers/md/md.c | 507 ++-- drivers/md/md.h | 40 +--- include/linux/badblocks.h | 53 + include/linux/genhd.h | 6 + 7 files changed, 687 insertions(+), 525 deletions(-) create mode 100644 block/badblocks.c create mode 100644 include/linux/badblocks.h -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] md: convert to use the generic badblocks code
Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma --- drivers/md/md.c | 507 +++- drivers/md/md.h | 40 + 2 files changed, 23 insertions(+), 524 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index c702de1..63eab20 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -34,6 +34,7 @@ #include #include +#include #include #include #include @@ -707,8 +708,7 @@ void md_rdev_clear(struct md_rdev *rdev) put_page(rdev->bb_page); rdev->bb_page = NULL; } - kfree(rdev->badblocks.page); - rdev->badblocks.page = NULL; + badblocks_free(&rdev->badblocks); } EXPORT_SYMBOL_GPL(md_rdev_clear); @@ -1358,8 +1358,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb) return cpu_to_le32(csum); } -static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors, - int acknowledged); static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version) { struct mdp_superblock_1 *sb; @@ -1484,7 +1482,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ count <<= sb->bblog_shift; if (bb + 1 == 0) break; - if (md_set_badblocks(&rdev->badblocks, + if (badblocks_set(&rdev->badblocks, sector, count, 1) == 0) return -EINVAL; } @@ -2226,7 +2224,7 @@ repeat: rdev_for_each(rdev, mddev) { if (rdev->badblocks.changed) { rdev->badblocks.changed = 0; - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); md_error(mddev, rdev); } clear_bit(Blocked, &rdev->flags); @@ -2352,7 +2350,7 @@ repeat: clear_bit(Blocked, &rdev->flags); if (any_badblocks_changed) - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); clear_bit(BlockedBadBlocks, &rdev->flags); wake_up(&rdev->blocked_wait); } @@ -2944,11 +2942,17 @@ static ssize_t recovery_start_store(struct md_rdev *rdev, const char *buf, size_ static struct rdev_sysfs_entry rdev_recovery_start = __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store); -static ssize_t -badblocks_show(struct badblocks *bb, char *page, int unack); -static ssize_t -badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack); - +/* sysfs access to bad-blocks list. + * We present two files. + * 'bad-blocks' lists sector numbers and lengths of ranges that + *are recorded as bad. The list is truncated to fit within + *the one-page limit of sysfs. + *Writing "sector length" to this file adds an acknowledged + *bad block list. + * 'unacknowledged-bad-blocks' lists bad blocks that have not yet + *been acknowledged. Writing to this file adds bad blocks + *without acknowledging them. This is largely for testing. + */ static ssize_t bb_show(struct md_rdev *rdev, char *page) { return badblocks_show(&rdev->badblocks, page, 0); @@ -3063,14 +3067,7 @@ int md_rdev_init(struct md_rdev *rdev) * This reserves the space even on arrays where it cannot * be used - I wonder if that matters */ - rdev->badblocks.count = 0; - rdev->badblocks.shift = -1; /* disabled until explicitly enabled */ - rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL); - seqlock_init(&rdev->badblocks.lock); - if (rdev->badblocks.page == NULL) - return -ENOMEM; - - return 0; + return badblocks_init(&rdev->badblocks, 0); } EXPORT_SYMBOL_GPL(md_rdev_init); /* @@ -8348,253 +8345,7 @@ void md_finish_reshape(struct mddev *mddev) } EXPORT_SYMBOL(md_finish_reshape); -/* Bad block management. - * We can record which blocks on each device are 'bad' and so just - * fail those blocks, or that stripe, rather than the whole device. - * Entries in the bad-block table are 64bits wide. This comprises: - * Length of bad-range, in sectors: 0-511 for lengths 1-512 - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) - * A 'shift' can be set so that larger blocks are tracked and - * consequently larger
[PATCH v2 1/3] badblocks: Add core badblock management code
Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: Vishal Verma --- block/Makefile| 2 +- block/badblocks.c | 523 ++ include/linux/badblocks.h | 53 + 3 files changed, 577 insertions(+), 1 deletion(-) create mode 100644 block/badblocks.c create mode 100644 include/linux/badblocks.h diff --git a/block/Makefile b/block/Makefile index 00ecc97..db5f622 100644 --- a/block/Makefile +++ b/block/Makefile @@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \ blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ - partitions/ + badblocks.o partitions/ obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_DEV_BSG) += bsg.o diff --git a/block/badblocks.c b/block/badblocks.c new file mode 100644 index 000..6e07855 --- /dev/null +++ b/block/badblocks.c @@ -0,0 +1,523 @@ +/* + * Bad block management + * + * - Heavily based on MD badblocks code from Neil Brown + * + * Copyright (c) 2015, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +/* + * We can record which blocks on each device are 'bad' and so just + * fail those blocks, or that stripe, rather than the whole device. + * Entries in the bad-block table are 64bits wide. This comprises: + * Length of bad-range, in sectors: 0-511 for lengths 1-512 + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) + * A 'shift' can be set so that larger blocks are tracked and + * consequently larger devices can be covered. + * 'Acknowledged' flag - 1 bit. - the most significant bit. + * + * Locking of the bad-block table uses a seqlock so badblocks_check + * might need to retry if it is very unlucky. + * We will sometimes want to check for bad blocks in a bi_end_io function, + * so we use the write_seqlock_irq variant. + * + * When looking for a bad block we specify a range and want to + * know if any block in the range is bad. So we binary-search + * to the last range that starts at-or-before the given endpoint, + * (or "before the sector after the target range") + * then see if it ends after the given start. + * We return + * 0 if there are no known bad blocks in the range + * 1 if there are known bad block which are all acknowledged + * -1 if there are bad blocks which have not yet been acknowledged in metadata. + * plus the start/length of the first bad section we overlap. + */ +int badblocks_check(struct badblocks *bb, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + int hi; + int lo; + u64 *p = bb->page; + int rv; + sector_t target = s + sectors; + unsigned seq; + + if (bb->shift > 0) { + /* round the start down, and the end up */ + s >>= bb->shift; + target += (1<shift) - 1; + target >>= bb->shift; + sectors = target - s; + } + /* 'target' is now the first block after the bad range */ + +retry: + seq = read_seqbegin(&bb->lock); + lo = 0; + rv = 0; + hi = bb->count; + + /* Binary search between lo and hi for 'target' +* i.e. for the last range that starts before 'target' +*/ + /* INVARIANT: ranges before 'lo' and at-or-after 'hi' +* are known not to be the last range before target. +* VARIANT: hi-lo is the number of possible +* ranges, and decreases until it reaches 1 +*/ + while (hi - lo > 1) { + int mid = (lo + hi) / 2; + sector_t a = BB_OFFSET(p[mid]); + + if (a < target) + /* This could still be the one, earlier ranges +* could not. +
[PATCH v2 2/3] block: Add badblock management for gendisks
NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: Vishal Verma --- block/genhd.c | 76 +++ include/linux/genhd.h | 7 + 2 files changed, 83 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index 0c706f3..809e3e2 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "blk.h" @@ -505,6 +506,16 @@ static int exact_lock(dev_t devt, void *data) return 0; } +int disk_alloc_badblocks(struct gendisk *disk) +{ + disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL); + if (!disk->bb) + return -ENOMEM; + + return badblocks_init(disk->bb, 1); +} +EXPORT_SYMBOL(disk_alloc_badblocks); + static void register_disk(struct gendisk *disk) { struct device *ddev = disk_to_dev(disk); @@ -657,6 +668,11 @@ void del_gendisk(struct gendisk *disk) blk_unregister_queue(disk); blk_unregister_region(disk_devt(disk), disk->minors); + if (disk->bb) { + badblocks_free(disk->bb); + kfree(disk->bb); + } + part_stat_set_all(&disk->part0, 0); disk->part0.stamp = 0; @@ -670,6 +686,63 @@ void del_gendisk(struct gendisk *disk) } EXPORT_SYMBOL(del_gendisk); +/* + * The gendisk usage of badblocks does not track acknowledgements for + * badblocks. We always assume they are acknowledged. + */ +int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors); +} +EXPORT_SYMBOL(disk_check_badblocks); + +int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_set(disk->bb, s, sectors, 1); +} +EXPORT_SYMBOL(disk_set_badblocks); + +int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_clear(disk->bb, s, sectors); +} +EXPORT_SYMBOL(disk_clear_badblocks); + +/* sysfs access to bad-blocks list. */ +static ssize_t disk_badblocks_show(struct device *dev, + struct device_attribute *attr, + char *page) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (!disk->bb) + return 0; + + return badblocks_show(disk->bb, page, 0); +} + +static ssize_t disk_badblocks_store(struct device *dev, + struct device_attribute *attr, + const char *page, size_t len) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (!disk->bb) + return 0; + + return badblocks_store(disk->bb, page, len, 0); +} + /** * get_gendisk - get partitioning information for a given device * @devt: device to get partitioning information for @@ -988,6 +1061,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show, static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL); static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL); +static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show, + disk_badblocks_store); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store); @@ -1009,6 +1084,7 @@ static struct attribute *disk_attrs[] = { &dev_attr_capability.attr, &dev_attr_stat.attr, &dev_attr_inflight.attr, + &dev_attr_badblocks.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 2adbfa6..985eb94 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -162,6 +162,7 @@ struct disk_part_tbl { }; struct disk_events; +struct badblocks; struct gendisk { /* major, first_minor and minors are input parameters only, @@ -201,6 +202,7 @@ struct gendisk { struct blk_integrity *integrity; #endif int node_id; + s
[PATCH v2 1/3] badblocks: Add core badblock management code
Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: Vishal Verma --- block/Makefile| 2 +- block/badblocks.c | 576 ++ include/linux/badblocks.h | 53 + 3 files changed, 630 insertions(+), 1 deletion(-) create mode 100644 block/badblocks.c create mode 100644 include/linux/badblocks.h diff --git a/block/Makefile b/block/Makefile index 00ecc97..db5f622 100644 --- a/block/Makefile +++ b/block/Makefile @@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \ blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ - partitions/ + badblocks.o partitions/ obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_DEV_BSG) += bsg.o diff --git a/block/badblocks.c b/block/badblocks.c new file mode 100644 index 000..f0ac279 --- /dev/null +++ b/block/badblocks.c @@ -0,0 +1,576 @@ +/* + * Bad block management + * + * - Heavily based on MD badblocks code from Neil Brown + * + * Copyright (c) 2015, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +/** + * badblocks_check() - check a given range for bad sectors + * @bb:the badblocks structure that holds all badblock information + * @s: sector (start) at which to check for badblocks + * @sectors: number of sectors to check for badblocks + * @first_bad: pointer to store location of the first badblock + * @bad_sectors: pointer to store number of badblocks after @first_bad + * + * We can record which blocks on each device are 'bad' and so just + * fail those blocks, or that stripe, rather than the whole device. + * Entries in the bad-block table are 64bits wide. This comprises: + * Length of bad-range, in sectors: 0-511 for lengths 1-512 + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) + * A 'shift' can be set so that larger blocks are tracked and + * consequently larger devices can be covered. + * 'Acknowledged' flag - 1 bit. - the most significant bit. + * + * Locking of the bad-block table uses a seqlock so badblocks_check + * might need to retry if it is very unlucky. + * We will sometimes want to check for bad blocks in a bi_end_io function, + * so we use the write_seqlock_irq variant. + * + * When looking for a bad block we specify a range and want to + * know if any block in the range is bad. So we binary-search + * to the last range that starts at-or-before the given endpoint, + * (or "before the sector after the target range") + * then see if it ends after the given start. + * + * Return: + * 0: there are no known bad blocks in the range + * 1: there are known bad block which are all acknowledged + * -1: there are bad blocks which have not yet been acknowledged in metadata. + * plus the start/length of the first bad section we overlap. + */ +int badblocks_check(struct badblocks *bb, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + int hi; + int lo; + u64 *p = bb->page; + int rv; + sector_t target = s + sectors; + unsigned seq; + + if (bb->shift > 0) { + /* round the start down, and the end up */ + s >>= bb->shift; + target += (1<shift) - 1; + target >>= bb->shift; + sectors = target - s; + } + /* 'target' is now the first block after the bad range */ + +retry: + seq = read_seqbegin(&bb->lock); + lo = 0; + rv = 0; + hi = bb->count; + + /* Binary search between lo and hi for 'target' +* i.e. for the last range that starts before 'target' +*/ + /* INVARIANT: ranges before 'lo' and at-or-after 'hi' +* are known not to be the last range before target. +
[PATCH v2 3/3] md: convert to use the generic badblocks code
Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma --- drivers/md/md.c | 516 +++- drivers/md/md.h | 40 + 2 files changed, 28 insertions(+), 528 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index c702de1..afdc3ea 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -34,6 +34,7 @@ #include #include +#include #include #include #include @@ -707,8 +708,7 @@ void md_rdev_clear(struct md_rdev *rdev) put_page(rdev->bb_page); rdev->bb_page = NULL; } - kfree(rdev->badblocks.page); - rdev->badblocks.page = NULL; + badblocks_free(&rdev->badblocks); } EXPORT_SYMBOL_GPL(md_rdev_clear); @@ -1358,8 +1358,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb) return cpu_to_le32(csum); } -static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors, - int acknowledged); static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version) { struct mdp_superblock_1 *sb; @@ -1484,8 +1482,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ count <<= sb->bblog_shift; if (bb + 1 == 0) break; - if (md_set_badblocks(&rdev->badblocks, -sector, count, 1) == 0) + if (badblocks_set(&rdev->badblocks, sector, count, 1)) return -EINVAL; } } else if (sb->bblog_offset != 0) @@ -2226,7 +2223,7 @@ repeat: rdev_for_each(rdev, mddev) { if (rdev->badblocks.changed) { rdev->badblocks.changed = 0; - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); md_error(mddev, rdev); } clear_bit(Blocked, &rdev->flags); @@ -2352,7 +2349,7 @@ repeat: clear_bit(Blocked, &rdev->flags); if (any_badblocks_changed) - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); clear_bit(BlockedBadBlocks, &rdev->flags); wake_up(&rdev->blocked_wait); } @@ -2944,11 +2941,17 @@ static ssize_t recovery_start_store(struct md_rdev *rdev, const char *buf, size_ static struct rdev_sysfs_entry rdev_recovery_start = __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store); -static ssize_t -badblocks_show(struct badblocks *bb, char *page, int unack); -static ssize_t -badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack); - +/* sysfs access to bad-blocks list. + * We present two files. + * 'bad-blocks' lists sector numbers and lengths of ranges that + *are recorded as bad. The list is truncated to fit within + *the one-page limit of sysfs. + *Writing "sector length" to this file adds an acknowledged + *bad block list. + * 'unacknowledged-bad-blocks' lists bad blocks that have not yet + *been acknowledged. Writing to this file adds bad blocks + *without acknowledging them. This is largely for testing. + */ static ssize_t bb_show(struct md_rdev *rdev, char *page) { return badblocks_show(&rdev->badblocks, page, 0); @@ -3063,14 +3066,7 @@ int md_rdev_init(struct md_rdev *rdev) * This reserves the space even on arrays where it cannot * be used - I wonder if that matters */ - rdev->badblocks.count = 0; - rdev->badblocks.shift = -1; /* disabled until explicitly enabled */ - rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL); - seqlock_init(&rdev->badblocks.lock); - if (rdev->badblocks.page == NULL) - return -ENOMEM; - - return 0; + return badblocks_init(&rdev->badblocks, 0); } EXPORT_SYMBOL_GPL(md_rdev_init); /* @@ -8348,254 +8344,9 @@ void md_finish_reshape(struct mddev *mddev) } EXPORT_SYMBOL(md_finish_reshape); -/* Bad block management. - * We can record which blocks on each device are 'bad' and so just - * fail those blocks, or that stripe, rather than the whole device. - * Entries in the bad-block table are 64bits wide. This comprises: - * Length of bad-range, in sectors: 0-511 for lengths 1-512 - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) - * A 'shift' can be
[PATCH v2 0/3] Badblock tracking for gendisks
v3: - Add kernel-doc style comments to all exported functions in badblocks.c (James) - Make return values from badblocks functions consistent with themselves and the kernel style. Change the polarity of badblocks_set, and update all callers accordingly (James) - In gendisk, don't unconditionally allocate badblocks, export the initializer. This also allows the initializer to be a non-void return type, so that the badblocks user can act upon failures better (James) v2: - In badblocks_free, make 'page' NULL (patch 1) - Move the core badblocks code to a new .c file (patch 1) (Jens) - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan) - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the genhd wrappers (patch 2) (Jeff) - Update the md conversion to also ise the badblocks init and free functions (patch 3) - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3) Patch 1 copies badblock management code into a header of its own, making it generally available. It follows common libraries of code such as linked lists, where anyone may embed a core data structure in another place, and use the provided accessor functions to manipulate the data. Patch 2 adds badblock tracking to gendisks (in preparation for use by NVDIMM devices). Patch 3 converts md over to use the new badblocks 'library'. I have done some pretty simple testing on this - created a raid 1 device, made sure the sysfs entries show up, and can be used to add and view badblocks. A closer look by the md folks would be nice here. Vishal Verma (3): badblocks: Add core badblock management code block: Add badblock management for gendisks md: convert to use the generic badblocks code block/Makefile| 2 +- block/badblocks.c | 576 ++ block/genhd.c | 76 ++ drivers/md/md.c | 516 ++--- drivers/md/md.h | 40 +--- include/linux/badblocks.h | 53 + include/linux/genhd.h | 7 + 7 files changed, 741 insertions(+), 529 deletions(-) create mode 100644 block/badblocks.c create mode 100644 include/linux/badblocks.h -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/3] Badblock tracking for gendisks
v4: - Rebase to v4.4-rc4 v3: - Add kernel-doc style comments to all exported functions in badblocks.c (James) - Make return values from badblocks functions consistent with themselves and the kernel style. Change the polarity of badblocks_set, and update all callers accordingly (James) - In gendisk, don't unconditionally allocate badblocks, export the initializer. This also allows the initializer to be a non-void return type, so that the badblocks user can act upon failures better (James) v2: - In badblocks_free, make 'page' NULL (patch 1) - Move the core badblocks code to a new .c file (patch 1) (Jens) - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan) - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the genhd wrappers (patch 2) (Jeff) - Update the md conversion to also ise the badblocks init and free functions (patch 3) - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3) Patch 1 copies badblock management code into a header of its own, making it generally available. It follows common libraries of code such as linked lists, where anyone may embed a core data structure in another place, and use the provided accessor functions to manipulate the data. Patch 2 adds badblock tracking to gendisks (in preparation for use by NVDIMM devices). Patch 3 converts md over to use the new badblocks 'library'. I have done some pretty simple testing on this - created a raid 1 device, made sure the sysfs entries show up, and can be used to add and view badblocks. A closer look by the md folks would be nice here. Vishal Verma (3): badblocks: Add core badblock management code block: Add badblock management for gendisks md: convert to use the generic badblocks code block/Makefile| 2 +- block/badblocks.c | 576 ++ block/genhd.c | 76 ++ drivers/md/md.c | 516 ++--- drivers/md/md.h | 40 +--- include/linux/badblocks.h | 53 + include/linux/genhd.h | 7 + 7 files changed, 741 insertions(+), 529 deletions(-) create mode 100644 block/badblocks.c create mode 100644 include/linux/badblocks.h -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/3] md: convert to use the generic badblocks code
Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma --- drivers/md/md.c | 516 +++- drivers/md/md.h | 40 + 2 files changed, 28 insertions(+), 528 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 807095f..1e48aa9 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -34,6 +34,7 @@ #include #include +#include #include #include #include @@ -709,8 +710,7 @@ void md_rdev_clear(struct md_rdev *rdev) put_page(rdev->bb_page); rdev->bb_page = NULL; } - kfree(rdev->badblocks.page); - rdev->badblocks.page = NULL; + badblocks_free(&rdev->badblocks); } EXPORT_SYMBOL_GPL(md_rdev_clear); @@ -1360,8 +1360,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb) return cpu_to_le32(csum); } -static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors, - int acknowledged); static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version) { struct mdp_superblock_1 *sb; @@ -1486,8 +1484,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ count <<= sb->bblog_shift; if (bb + 1 == 0) break; - if (md_set_badblocks(&rdev->badblocks, -sector, count, 1) == 0) + if (badblocks_set(&rdev->badblocks, sector, count, 1)) return -EINVAL; } } else if (sb->bblog_offset != 0) @@ -2319,7 +2316,7 @@ repeat: rdev_for_each(rdev, mddev) { if (rdev->badblocks.changed) { rdev->badblocks.changed = 0; - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); md_error(mddev, rdev); } clear_bit(Blocked, &rdev->flags); @@ -2445,7 +2442,7 @@ repeat: clear_bit(Blocked, &rdev->flags); if (any_badblocks_changed) - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); clear_bit(BlockedBadBlocks, &rdev->flags); wake_up(&rdev->blocked_wait); } @@ -3046,11 +3043,17 @@ static ssize_t recovery_start_store(struct md_rdev *rdev, const char *buf, size_ static struct rdev_sysfs_entry rdev_recovery_start = __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store); -static ssize_t -badblocks_show(struct badblocks *bb, char *page, int unack); -static ssize_t -badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack); - +/* sysfs access to bad-blocks list. + * We present two files. + * 'bad-blocks' lists sector numbers and lengths of ranges that + *are recorded as bad. The list is truncated to fit within + *the one-page limit of sysfs. + *Writing "sector length" to this file adds an acknowledged + *bad block list. + * 'unacknowledged-bad-blocks' lists bad blocks that have not yet + *been acknowledged. Writing to this file adds bad blocks + *without acknowledging them. This is largely for testing. + */ static ssize_t bb_show(struct md_rdev *rdev, char *page) { return badblocks_show(&rdev->badblocks, page, 0); @@ -3165,14 +3168,7 @@ int md_rdev_init(struct md_rdev *rdev) * This reserves the space even on arrays where it cannot * be used - I wonder if that matters */ - rdev->badblocks.count = 0; - rdev->badblocks.shift = -1; /* disabled until explicitly enabled */ - rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL); - seqlock_init(&rdev->badblocks.lock); - if (rdev->badblocks.page == NULL) - return -ENOMEM; - - return 0; + return badblocks_init(&rdev->badblocks, 0); } EXPORT_SYMBOL_GPL(md_rdev_init); /* @@ -8478,254 +8474,9 @@ void md_finish_reshape(struct mddev *mddev) } EXPORT_SYMBOL(md_finish_reshape); -/* Bad block management. - * We can record which blocks on each device are 'bad' and so just - * fail those blocks, or that stripe, rather than the whole device. - * Entries in the bad-block table are 64bits wide. This comprises: - * Length of bad-range, in sectors: 0-511 for lengths 1-512 - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) - * A 'shift' can be
[PATCH v4 2/3] block: Add badblock management for gendisks
NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: Vishal Verma --- block/genhd.c | 76 +++ include/linux/genhd.h | 7 + 2 files changed, 83 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index e5cafa5..81dcf32 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "blk.h" @@ -505,6 +506,16 @@ static int exact_lock(dev_t devt, void *data) return 0; } +int disk_alloc_badblocks(struct gendisk *disk) +{ + disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL); + if (!disk->bb) + return -ENOMEM; + + return badblocks_init(disk->bb, 1); +} +EXPORT_SYMBOL(disk_alloc_badblocks); + static void register_disk(struct gendisk *disk) { struct device *ddev = disk_to_dev(disk); @@ -659,6 +670,11 @@ void del_gendisk(struct gendisk *disk) blk_unregister_queue(disk); blk_unregister_region(disk_devt(disk), disk->minors); + if (disk->bb) { + badblocks_free(disk->bb); + kfree(disk->bb); + } + part_stat_set_all(&disk->part0, 0); disk->part0.stamp = 0; @@ -672,6 +688,63 @@ void del_gendisk(struct gendisk *disk) } EXPORT_SYMBOL(del_gendisk); +/* + * The gendisk usage of badblocks does not track acknowledgements for + * badblocks. We always assume they are acknowledged. + */ +int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors); +} +EXPORT_SYMBOL(disk_check_badblocks); + +int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_set(disk->bb, s, sectors, 1); +} +EXPORT_SYMBOL(disk_set_badblocks); + +int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_clear(disk->bb, s, sectors); +} +EXPORT_SYMBOL(disk_clear_badblocks); + +/* sysfs access to bad-blocks list. */ +static ssize_t disk_badblocks_show(struct device *dev, + struct device_attribute *attr, + char *page) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (!disk->bb) + return 0; + + return badblocks_show(disk->bb, page, 0); +} + +static ssize_t disk_badblocks_store(struct device *dev, + struct device_attribute *attr, + const char *page, size_t len) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (!disk->bb) + return 0; + + return badblocks_store(disk->bb, page, len, 0); +} + /** * get_gendisk - get partitioning information for a given device * @devt: device to get partitioning information for @@ -990,6 +1063,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show, static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL); static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL); +static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show, + disk_badblocks_store); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store); @@ -1011,6 +1086,7 @@ static struct attribute *disk_attrs[] = { &dev_attr_capability.attr, &dev_attr_stat.attr, &dev_attr_inflight.attr, + &dev_attr_badblocks.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 847cc1d..0bbec68 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -162,6 +162,7 @@ struct disk_part_tbl { }; struct disk_events; +struct badblocks; #if defined(CONFIG_BLK_DEV_INTEGRITY) @@ -213,6 +214,7 @@ struct gendisk { struct kobject integrity_kobj; #endif /* CONFIG_BLK_DEV_INTEGRITY */ int node_id; + struct badblocks *b
[PATCH v4 1/3] badblocks: Add core badblock management code
Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: Vishal Verma --- block/Makefile| 2 +- block/badblocks.c | 576 ++ include/linux/badblocks.h | 53 + 3 files changed, 630 insertions(+), 1 deletion(-) create mode 100644 block/badblocks.c create mode 100644 include/linux/badblocks.h diff --git a/block/Makefile b/block/Makefile index 00ecc97..db5f622 100644 --- a/block/Makefile +++ b/block/Makefile @@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \ blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ - partitions/ + badblocks.o partitions/ obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_DEV_BSG) += bsg.o diff --git a/block/badblocks.c b/block/badblocks.c new file mode 100644 index 000..f0ac279 --- /dev/null +++ b/block/badblocks.c @@ -0,0 +1,576 @@ +/* + * Bad block management + * + * - Heavily based on MD badblocks code from Neil Brown + * + * Copyright (c) 2015, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +/** + * badblocks_check() - check a given range for bad sectors + * @bb:the badblocks structure that holds all badblock information + * @s: sector (start) at which to check for badblocks + * @sectors: number of sectors to check for badblocks + * @first_bad: pointer to store location of the first badblock + * @bad_sectors: pointer to store number of badblocks after @first_bad + * + * We can record which blocks on each device are 'bad' and so just + * fail those blocks, or that stripe, rather than the whole device. + * Entries in the bad-block table are 64bits wide. This comprises: + * Length of bad-range, in sectors: 0-511 for lengths 1-512 + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) + * A 'shift' can be set so that larger blocks are tracked and + * consequently larger devices can be covered. + * 'Acknowledged' flag - 1 bit. - the most significant bit. + * + * Locking of the bad-block table uses a seqlock so badblocks_check + * might need to retry if it is very unlucky. + * We will sometimes want to check for bad blocks in a bi_end_io function, + * so we use the write_seqlock_irq variant. + * + * When looking for a bad block we specify a range and want to + * know if any block in the range is bad. So we binary-search + * to the last range that starts at-or-before the given endpoint, + * (or "before the sector after the target range") + * then see if it ends after the given start. + * + * Return: + * 0: there are no known bad blocks in the range + * 1: there are known bad block which are all acknowledged + * -1: there are bad blocks which have not yet been acknowledged in metadata. + * plus the start/length of the first bad section we overlap. + */ +int badblocks_check(struct badblocks *bb, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + int hi; + int lo; + u64 *p = bb->page; + int rv; + sector_t target = s + sectors; + unsigned seq; + + if (bb->shift > 0) { + /* round the start down, and the end up */ + s >>= bb->shift; + target += (1<shift) - 1; + target >>= bb->shift; + sectors = target - s; + } + /* 'target' is now the first block after the bad range */ + +retry: + seq = read_seqbegin(&bb->lock); + lo = 0; + rv = 0; + hi = bb->count; + + /* Binary search between lo and hi for 'target' +* i.e. for the last range that starts before 'target' +*/ + /* INVARIANT: ranges before 'lo' and at-or-after 'hi' +* are known not to be the last range before target. +
[PATCH v5 2/3] block: Add badblock management for gendisks
NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: Vishal Verma --- block/genhd.c | 76 +++ include/linux/genhd.h | 7 + 2 files changed, 83 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index e5cafa5..81dcf32 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "blk.h" @@ -505,6 +506,16 @@ static int exact_lock(dev_t devt, void *data) return 0; } +int disk_alloc_badblocks(struct gendisk *disk) +{ + disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL); + if (!disk->bb) + return -ENOMEM; + + return badblocks_init(disk->bb, 1); +} +EXPORT_SYMBOL(disk_alloc_badblocks); + static void register_disk(struct gendisk *disk) { struct device *ddev = disk_to_dev(disk); @@ -659,6 +670,11 @@ void del_gendisk(struct gendisk *disk) blk_unregister_queue(disk); blk_unregister_region(disk_devt(disk), disk->minors); + if (disk->bb) { + badblocks_free(disk->bb); + kfree(disk->bb); + } + part_stat_set_all(&disk->part0, 0); disk->part0.stamp = 0; @@ -672,6 +688,63 @@ void del_gendisk(struct gendisk *disk) } EXPORT_SYMBOL(del_gendisk); +/* + * The gendisk usage of badblocks does not track acknowledgements for + * badblocks. We always assume they are acknowledged. + */ +int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors); +} +EXPORT_SYMBOL(disk_check_badblocks); + +int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_set(disk->bb, s, sectors, 1); +} +EXPORT_SYMBOL(disk_set_badblocks); + +int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors) +{ + if (!disk->bb) + return 0; + + return badblocks_clear(disk->bb, s, sectors); +} +EXPORT_SYMBOL(disk_clear_badblocks); + +/* sysfs access to bad-blocks list. */ +static ssize_t disk_badblocks_show(struct device *dev, + struct device_attribute *attr, + char *page) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (!disk->bb) + return 0; + + return badblocks_show(disk->bb, page, 0); +} + +static ssize_t disk_badblocks_store(struct device *dev, + struct device_attribute *attr, + const char *page, size_t len) +{ + struct gendisk *disk = dev_to_disk(dev); + + if (!disk->bb) + return 0; + + return badblocks_store(disk->bb, page, len, 0); +} + /** * get_gendisk - get partitioning information for a given device * @devt: device to get partitioning information for @@ -990,6 +1063,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show, static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL); static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL); +static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show, + disk_badblocks_store); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store); @@ -1011,6 +1086,7 @@ static struct attribute *disk_attrs[] = { &dev_attr_capability.attr, &dev_attr_stat.attr, &dev_attr_inflight.attr, + &dev_attr_badblocks.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 847cc1d..0bbec68 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -162,6 +162,7 @@ struct disk_part_tbl { }; struct disk_events; +struct badblocks; #if defined(CONFIG_BLK_DEV_INTEGRITY) @@ -213,6 +214,7 @@ struct gendisk { struct kobject integrity_kobj; #endif /* CONFIG_BLK_DEV_INTEGRITY */ int node_id; + struct badblocks *b
[PATCH v5 3/3] md: convert to use the generic badblocks code
Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma --- drivers/md/md.c | 516 +++- drivers/md/md.h | 40 + 2 files changed, 28 insertions(+), 528 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index dbedc58..51dc9f3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -34,6 +34,7 @@ #include #include +#include #include #include #include @@ -710,8 +711,7 @@ void md_rdev_clear(struct md_rdev *rdev) put_page(rdev->bb_page); rdev->bb_page = NULL; } - kfree(rdev->badblocks.page); - rdev->badblocks.page = NULL; + badblocks_free(&rdev->badblocks); } EXPORT_SYMBOL_GPL(md_rdev_clear); @@ -1361,8 +1361,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb) return cpu_to_le32(csum); } -static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors, - int acknowledged); static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_version) { struct mdp_superblock_1 *sb; @@ -1487,8 +1485,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ count <<= sb->bblog_shift; if (bb + 1 == 0) break; - if (md_set_badblocks(&rdev->badblocks, -sector, count, 1) == 0) + if (badblocks_set(&rdev->badblocks, sector, count, 1)) return -EINVAL; } } else if (sb->bblog_offset != 0) @@ -2320,7 +2317,7 @@ repeat: rdev_for_each(rdev, mddev) { if (rdev->badblocks.changed) { rdev->badblocks.changed = 0; - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); md_error(mddev, rdev); } clear_bit(Blocked, &rdev->flags); @@ -2446,7 +2443,7 @@ repeat: clear_bit(Blocked, &rdev->flags); if (any_badblocks_changed) - md_ack_all_badblocks(&rdev->badblocks); + ack_all_badblocks(&rdev->badblocks); clear_bit(BlockedBadBlocks, &rdev->flags); wake_up(&rdev->blocked_wait); } @@ -3054,11 +3051,17 @@ static ssize_t recovery_start_store(struct md_rdev *rdev, const char *buf, size_ static struct rdev_sysfs_entry rdev_recovery_start = __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store); -static ssize_t -badblocks_show(struct badblocks *bb, char *page, int unack); -static ssize_t -badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack); - +/* sysfs access to bad-blocks list. + * We present two files. + * 'bad-blocks' lists sector numbers and lengths of ranges that + *are recorded as bad. The list is truncated to fit within + *the one-page limit of sysfs. + *Writing "sector length" to this file adds an acknowledged + *bad block list. + * 'unacknowledged-bad-blocks' lists bad blocks that have not yet + *been acknowledged. Writing to this file adds bad blocks + *without acknowledging them. This is largely for testing. + */ static ssize_t bb_show(struct md_rdev *rdev, char *page) { return badblocks_show(&rdev->badblocks, page, 0); @@ -3173,14 +3176,7 @@ int md_rdev_init(struct md_rdev *rdev) * This reserves the space even on arrays where it cannot * be used - I wonder if that matters */ - rdev->badblocks.count = 0; - rdev->badblocks.shift = -1; /* disabled until explicitly enabled */ - rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL); - seqlock_init(&rdev->badblocks.lock); - if (rdev->badblocks.page == NULL) - return -ENOMEM; - - return 0; + return badblocks_init(&rdev->badblocks, 0); } EXPORT_SYMBOL_GPL(md_rdev_init); /* @@ -8486,254 +8482,9 @@ void md_finish_reshape(struct mddev *mddev) } EXPORT_SYMBOL(md_finish_reshape); -/* Bad block management. - * We can record which blocks on each device are 'bad' and so just - * fail those blocks, or that stripe, rather than the whole device. - * Entries in the bad-block table are 64bits wide. This comprises: - * Length of bad-range, in sectors: 0-511 for lengths 1-512 - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) - * A 'shift' can be
[PATCH v5 1/3] badblocks: Add core badblock management code
Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: Vishal Verma --- block/Makefile| 2 +- block/badblocks.c | 561 ++ include/linux/badblocks.h | 53 + 3 files changed, 615 insertions(+), 1 deletion(-) create mode 100644 block/badblocks.c create mode 100644 include/linux/badblocks.h diff --git a/block/Makefile b/block/Makefile index 00ecc97..db5f622 100644 --- a/block/Makefile +++ b/block/Makefile @@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \ blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ - partitions/ + badblocks.o partitions/ obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_DEV_BSG) += bsg.o diff --git a/block/badblocks.c b/block/badblocks.c new file mode 100644 index 000..96aeb91 --- /dev/null +++ b/block/badblocks.c @@ -0,0 +1,561 @@ +/* + * Bad block management + * + * - Heavily based on MD badblocks code from Neil Brown + * + * Copyright (c) 2015, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +/** + * badblocks_check() - check a given range for bad sectors + * @bb:the badblocks structure that holds all badblock information + * @s: sector (start) at which to check for badblocks + * @sectors: number of sectors to check for badblocks + * @first_bad: pointer to store location of the first badblock + * @bad_sectors: pointer to store number of badblocks after @first_bad + * + * We can record which blocks on each device are 'bad' and so just + * fail those blocks, or that stripe, rather than the whole device. + * Entries in the bad-block table are 64bits wide. This comprises: + * Length of bad-range, in sectors: 0-511 for lengths 1-512 + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes) + * A 'shift' can be set so that larger blocks are tracked and + * consequently larger devices can be covered. + * 'Acknowledged' flag - 1 bit. - the most significant bit. + * + * Locking of the bad-block table uses a seqlock so badblocks_check + * might need to retry if it is very unlucky. + * We will sometimes want to check for bad blocks in a bi_end_io function, + * so we use the write_seqlock_irq variant. + * + * When looking for a bad block we specify a range and want to + * know if any block in the range is bad. So we binary-search + * to the last range that starts at-or-before the given endpoint, + * (or "before the sector after the target range") + * then see if it ends after the given start. + * + * Return: + * 0: there are no known bad blocks in the range + * 1: there are known bad block which are all acknowledged + * -1: there are bad blocks which have not yet been acknowledged in metadata. + * plus the start/length of the first bad section we overlap. + */ +int badblocks_check(struct badblocks *bb, sector_t s, int sectors, + sector_t *first_bad, int *bad_sectors) +{ + int hi; + int lo; + u64 *p = bb->page; + int rv; + sector_t target = s + sectors; + unsigned seq; + + if (bb->shift > 0) { + /* round the start down, and the end up */ + s >>= bb->shift; + target += (1<shift) - 1; + target >>= bb->shift; + sectors = target - s; + } + /* 'target' is now the first block after the bad range */ + +retry: + seq = read_seqbegin(&bb->lock); + lo = 0; + rv = 0; + hi = bb->count; + + /* Binary search between lo and hi for 'target' +* i.e. for the last range that starts before 'target' +*/ + /* INVARIANT: ranges before 'lo' and at-or-after 'hi' +* are known not to be the last range before target. +
[PATCH v5 0/3] Badblock tracking for gendisks
v5: - Rebase to v4.4-rc6 - Revert back to using kzalloc from __get_free_page based on the discussion at: http://thread.gmane.org/gmane.linux.kernel/2113292 v4: - Rebase to v4.4-rc4 v3: - Add kernel-doc style comments to all exported functions in badblocks.c (James) - Make return values from badblocks functions consistent with themselves and the kernel style. Change the polarity of badblocks_set, and update all callers accordingly (James) - In gendisk, don't unconditionally allocate badblocks, export the initializer. This also allows the initializer to be a non-void return type, so that the badblocks user can act upon failures better (James) v2: - In badblocks_free, make 'page' NULL (patch 1) - Move the core badblocks code to a new .c file (patch 1) (Jens) - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan) - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the genhd wrappers (patch 2) (Jeff) - Update the md conversion to also ise the badblocks init and free functions (patch 3) - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3) Patch 1 copies badblock management code into a header of its own, making it generally available. It follows common libraries of code such as linked lists, where anyone may embed a core data structure in another place, and use the provided accessor functions to manipulate the data. Patch 2 adds badblock tracking to gendisks (in preparation for use by NVDIMM devices). Patch 3 converts md over to use the new badblocks 'library'. I have done some pretty simple testing on this - created a raid 1 device, made sure the sysfs entries show up, and can be used to add and view badblocks. A closer look by the md folks would be nice here. Vishal Verma (3): badblocks: Add core badblock management code block: Add badblock management for gendisks md: convert to use the generic badblocks code block/Makefile| 2 +- block/badblocks.c | 561 ++ block/genhd.c | 76 +++ drivers/md/md.c | 516 +++--- drivers/md/md.h | 40 +--- include/linux/badblocks.h | 53 + include/linux/genhd.h | 7 + 7 files changed, 726 insertions(+), 529 deletions(-) create mode 100644 block/badblocks.c create mode 100644 include/linux/badblocks.h -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html