[RFC PATCH] block: introduce poison tracking for block devices

2015-11-19 Thread Vishal Verma
This patch copies the badblock management code from md-raid to use it
for tracking bad/'poison' sectors on a per-block device level.

NVDIMM devices, which behave more like DRAM, may develop bad cache
lines, or 'poison'. A block device exposed by the pmem driver can
then consume poison via a read (or write), and cause a machine check.
On platforms without machine check recovery features, this would
mean a crash.

The block device maintaining a runtime list of all known poison can
directly avoid this, and also provide a path forward to enable proper
handling/recovery for DAX faults on such a device.

Signed-off-by: Vishal Verma 
---

This really is a copy-paste + a few modifications of the badblock management
code + sysfs representation from md.

In this RFC, I want to make sure this path sounds acceptable for the use case
described above, for NVDIMMs. Eventually, I think the md badblock management
and this should be refactored to use the same code - I think this should be
easy to do:
- move the badblocks struct and associated functions into a header file (along
  the lines of include/linux/list.h)
- embed the structure into whatever needs to use this list (in case of md, this
  would be 'rdev', in the nvdimm case, the gendisk)
- call the functions from badblocks.h as needed to manipulate the list.
- The sysfs show/store functions in badblocks.h would be generic variants, with
  wrappers being present in md and gendisk to fit into their respective sysfs
  layouts

If this looks generally reasonable, I'll post a v2 with this refactoring done.

 block/genhd.c | 502 ++
 include/linux/genhd.h |  26 +++
 2 files changed, 528 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 0c706f3..de99d28 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -23,6 +23,15 @@
 
 #include "blk.h"
 
+#define BB_LEN_MASK(0x01FFULL)
+#define BB_OFFSET_MASK (0x7E00ULL)
+#define BB_ACK_MASK(0x8000ULL)
+#define BB_MAX_LEN 512
+#define BB_OFFSET(x)   (((x) & BB_OFFSET_MASK) >> 9)
+#define BB_LEN(x)  (((x) & BB_LEN_MASK) + 1)
+#define BB_ACK(x)  (!!((x) & BB_ACK_MASK))
+#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
+
 static DEFINE_MUTEX(block_class_lock);
 struct kobject *block_depr;
 
@@ -670,6 +679,496 @@ void del_gendisk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(del_gendisk);
 
+int disk_poison_list_init(struct gendisk *disk)
+{
+   disk->plist = kmalloc(sizeof(*disk->plist), GFP_KERNEL);
+   if (!disk->plist)
+   return -ENOMEM;
+   disk->plist->count = 0;
+   disk->plist->shift = 0;
+   disk->plist->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+   seqlock_init(&disk->plist->lock);
+   if (disk->plist->page == NULL)
+   return -ENOMEM;
+
+   return 0;
+}
+EXPORT_SYMBOL(disk_poison_list_init);
+
+/* Bad block management.
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide.  This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ *  A 'shift' can be set so that larger blocks are tracked and
+ *  consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so md_is_badblock
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad.  So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ *  0 if there are no known bad blocks in the range
+ *  1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int disk_check_poison(struct gendisk *disk, sector_t s, int sectors,
+  sector_t *first_bad, int *bad_sectors)
+{
+   struct disk_poison *bb = disk->plist;
+   int hi;
+   int lo;
+   u64 *p = bb->page;
+   int rv;
+   sector_t target = s + sectors;
+   unsigned seq;
+
+   if (bb->shift > 0) {
+   /* round the start down, and the end up */
+   s >>= bb->shift;
+   target += (1<shift) - 1;
+   target >>= bb->shift;
+   sect

[PATCH 2/3] block: Add badblock management for gendisks

2015-11-20 Thread Vishal Verma
NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.

The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.

Use the new badblock management interfaces to add a badblocks list to
gendisks.

Signed-off-by: Vishal Verma 
---
 block/genhd.c | 64 +++
 include/linux/genhd.h |  6 +
 2 files changed, 70 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 0c706f3..4209c32 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "blk.h"
 
@@ -505,6 +506,20 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
+static void disk_alloc_badblocks(struct gendisk *disk)
+{
+   disk->bb = kzalloc(sizeof(disk->bb), GFP_KERNEL);
+   if (!disk->bb) {
+   pr_warn("%s: failed to allocate space for badblocks\n",
+   disk->disk_name);
+   return;
+   }
+
+   if (badblocks_init(disk->bb, 1))
+   pr_warn("%s: failed to initialize badblocks\n",
+   disk->disk_name);
+}
+
 static void register_disk(struct gendisk *disk)
 {
struct device *ddev = disk_to_dev(disk);
@@ -609,6 +624,7 @@ void add_disk(struct gendisk *disk)
disk->first_minor = MINOR(devt);
 
disk_alloc_events(disk);
+   disk_alloc_badblocks(disk);
 
/* Register BDI before referencing it from bdev */
bdi = &disk->queue->backing_dev_info;
@@ -657,6 +673,9 @@ void del_gendisk(struct gendisk *disk)
blk_unregister_queue(disk);
blk_unregister_region(disk_devt(disk), disk->minors);
 
+   badblocks_free(disk->bb);
+   kfree(disk->bb);
+
part_stat_set_all(&disk->part0, 0);
disk->part0.stamp = 0;
 
@@ -670,6 +689,48 @@ void del_gendisk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(del_gendisk);
 
+/*
+ * The gendisk usage of badblocks does not track acknowledgements for
+ * badblocks. We always assume they are acknowledged.
+ */
+int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors,
+  sector_t *first_bad, int *bad_sectors)
+{
+   return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors);
+}
+EXPORT_SYMBOL(disk_check_badblocks);
+
+int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   return badblocks_set(disk->bb, s, sectors, 1);
+}
+EXPORT_SYMBOL(disk_set_badblocks);
+
+int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   return badblocks_clear(disk->bb, s, sectors);
+}
+EXPORT_SYMBOL(disk_clear_badblocks);
+
+/* sysfs access to bad-blocks list. */
+static ssize_t disk_badblocks_show(struct device *dev,
+   struct device_attribute *attr,
+   char *page)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   return badblocks_show(disk->bb, page, 0);
+}
+
+static ssize_t disk_badblocks_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *page, size_t len)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   return badblocks_store(disk->bb, page, len, 0);
+}
+
 /**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
@@ -988,6 +1049,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, 
disk_discard_alignment_show,
 static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
 static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
+   disk_badblocks_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1009,6 +1072,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_capability.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+   &dev_attr_badblocks.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
 #endif
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 2adbfa6..5563bde 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -162,6 +162,7 @@ struct disk_part_tbl {
 };
 
 struct disk_events;
+struct badblocks;
 
 struct gendisk {

[PATCH 0/3] Badblock tracking for gendisks

2015-11-20 Thread Vishal Verma
Patch 1 copies badblock management code into a header of its own,
making it generally available. It follows common libraries of code
such as linked lists, where anyone may embed a core data structure
in another place, and use the provided accessor functions to
manipulate the data.

Patch 2 adds badblock tracking to gendisks (in preparation for use
by NVDIMM devices). Right now, it is turned on unconditionally - I'd
appreciate comments on if that is the right path.

Patch 3 converts md over to use the new badblocks 'library'. I have
done some pretty simple testing on this - created a raid 1 device,
made sure the sysfs entries show up, and can be used to add and view
badblocks. A closer look by the md folks would be nice here.

Vishal Verma (3):
  badblocks: Add core badblock management code
  block: Add badblock management for gendisks
  md: convert to use the generic badblocks code

 block/genhd.c |  64 ++
 drivers/md/md.c   | 495 ++--
 drivers/md/md.h   |  31 +--
 include/linux/badblocks.h | 512 ++
 include/linux/genhd.h |   6 +
 5 files changed, 603 insertions(+), 505 deletions(-)
 create mode 100644 include/linux/badblocks.h

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] md: convert to use the generic badblocks code

2015-11-20 Thread Vishal Verma
Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma 
---
 drivers/md/md.c | 495 +++-
 drivers/md/md.h |  31 +---
 2 files changed, 21 insertions(+), 505 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c702de1..82994d7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -34,6 +34,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1358,8 +1359,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
return cpu_to_le32(csum);
 }
 
-static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
-   int acknowledged);
 static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int 
minor_version)
 {
struct mdp_superblock_1 *sb;
@@ -1484,7 +1483,7 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
count <<= sb->bblog_shift;
if (bb + 1 == 0)
break;
-   if (md_set_badblocks(&rdev->badblocks,
+   if (badblocks_set(&rdev->badblocks,
 sector, count, 1) == 0)
return -EINVAL;
}
@@ -2226,7 +2225,7 @@ repeat:
rdev_for_each(rdev, mddev) {
if (rdev->badblocks.changed) {
rdev->badblocks.changed = 0;
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
md_error(mddev, rdev);
}
clear_bit(Blocked, &rdev->flags);
@@ -2352,7 +2351,7 @@ repeat:
clear_bit(Blocked, &rdev->flags);
 
if (any_badblocks_changed)
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
clear_bit(BlockedBadBlocks, &rdev->flags);
wake_up(&rdev->blocked_wait);
}
@@ -2944,11 +2943,17 @@ static ssize_t recovery_start_store(struct md_rdev 
*rdev, const char *buf, size_
 static struct rdev_sysfs_entry rdev_recovery_start =
 __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, 
recovery_start_store);
 
-static ssize_t
-badblocks_show(struct badblocks *bb, char *page, int unack);
-static ssize_t
-badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
-
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ *are recorded as bad.  The list is truncated to fit within
+ *the one-page limit of sysfs.
+ *Writing "sector length" to this file adds an acknowledged
+ *bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ *been acknowledged.  Writing to this file adds bad blocks
+ *without acknowledging them.  This is largely for testing.
+ */
 static ssize_t bb_show(struct md_rdev *rdev, char *page)
 {
return badblocks_show(&rdev->badblocks, page, 0);
@@ -8348,253 +8353,7 @@ void md_finish_reshape(struct mddev *mddev)
 }
 EXPORT_SYMBOL(md_finish_reshape);
 
-/* Bad block management.
- * We can record which blocks on each device are 'bad' and so just
- * fail those blocks, or that stripe, rather than the whole device.
- * Entries in the bad-block table are 64bits wide.  This comprises:
- * Length of bad-range, in sectors: 0-511 for lengths 1-512
- * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
- *  A 'shift' can be set so that larger blocks are tracked and
- *  consequently larger devices can be covered.
- * 'Acknowledged' flag - 1 bit. - the most significant bit.
- *
- * Locking of the bad-block table uses a seqlock so md_is_badblock
- * might need to retry if it is very unlucky.
- * We will sometimes want to check for bad blocks in a bi_end_io function,
- * so we use the write_seqlock_irq variant.
- *
- * When looking for a bad block we specify a range and want to
- * know if any block in the range is bad.  So we binary-search
- * to the last range that starts at-or-before the given endpoint,
- * (or "before the sector after the target range")
- * then see if it ends after the given start.
- * We return
- *  0 if there are no known bad blocks in the range
- *  1 if there are known bad block which are all acknowledged
- * -1 if there are bad blocks which have not yet been acknowledged in metadata.
- * plus the start/length of the first bad section we overlap.
- */
-int md_is_badblo

[PATCH 1/3] badblocks: Add core badblock management code

2015-11-20 Thread Vishal Verma
Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.

The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.

Signed-off-by: Vishal Verma 
---
 include/linux/badblocks.h | 512 ++
 1 file changed, 512 insertions(+)
 create mode 100644 include/linux/badblocks.h

diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h
new file mode 100644
index 000..94fa348
--- /dev/null
+++ b/include/linux/badblocks.h
@@ -0,0 +1,512 @@
+#ifndef _LINUX_BADBLOCKS_H
+#define _LINUX_BADBLOCKS_H
+
+#include 
+#include 
+#include 
+
+#define BB_LEN_MASK(0x01FFULL)
+#define BB_OFFSET_MASK (0x7E00ULL)
+#define BB_ACK_MASK(0x8000ULL)
+#define BB_MAX_LEN 512
+#define BB_OFFSET(x)   (((x) & BB_OFFSET_MASK) >> 9)
+#define BB_LEN(x)  (((x) & BB_LEN_MASK) + 1)
+#define BB_ACK(x)  (!!((x) & BB_ACK_MASK))
+#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
+
+/* Bad block numbers are stored sorted in a single page.
+ * 64bits is used for each block or extent.
+ * 54 bits are sector number, 9 bits are extent size,
+ * 1 bit is an 'acknowledged' flag.
+ */
+#define MAX_BADBLOCKS  (PAGE_SIZE/8)
+
+struct badblocks {
+   int count;  /* count of bad blocks */
+   int unacked_exist;  /* there probably are unacknowledged
+* bad blocks.  This is only cleared
+* when a read discovers none
+*/
+   int shift;  /* shift from sectors to block size
+* a -ve shift means badblocks are
+* disabled.*/
+   u64 *page;  /* badblock list */
+   int changed;
+   seqlock_t lock;
+   sector_t sector;
+   sector_t size;  /* in sectors */
+};
+
+/* Bad block management.
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide.  This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ *  A 'shift' can be set so that larger blocks are tracked and
+ *  consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so badblocks_check
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad.  So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ *  0 if there are no known bad blocks in the range
+ *  1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+static inline int badblocks_check(struct badblocks *bb, sector_t s, int 
sectors,
+  sector_t *first_bad, int *bad_sectors)
+{
+   int hi;
+   int lo;
+   u64 *p = bb->page;
+   int rv;
+   sector_t target = s + sectors;
+   unsigned seq;
+
+   if (bb->shift > 0) {
+   /* round the start down, and the end up */
+   s >>= bb->shift;
+   target += (1<shift) - 1;
+   target >>= bb->shift;
+   sectors = target - s;
+   }
+   /* 'target' is now the first block after the bad range */
+
+retry:
+   seq = read_seqbegin(&bb->lock);
+   lo = 0;
+   rv = 0;
+   hi = bb->count;
+
+   /* Binary search between lo and hi for 'target'
+* i.e. for the last range that starts before 'target'
+*/
+   /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+* are known not to be the last range before target.
+* VARIANT: hi-lo is the number of possible
+* ranges, and decreases until it reaches 1
+*/
+   while (hi - lo > 1) {
+   int mid = (lo + hi) / 2;
+   sector_t a = BB_OFFSET(p[mid]);
+   if (a < target)
+   /* This could still

[PATCH v2 2/3] block: Add badblock management for gendisks

2015-11-25 Thread Vishal Verma
NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.

The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.

Use the new badblock management interfaces to add a badblocks list to
gendisks.

Signed-off-by: Vishal Verma 
---
 block/genhd.c | 81 +++
 include/linux/genhd.h |  6 
 2 files changed, 87 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 0c706f3..84fd65c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "blk.h"
 
@@ -505,6 +506,20 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
+static void disk_alloc_badblocks(struct gendisk *disk)
+{
+   disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL);
+   if (!disk->bb) {
+   pr_warn("%s: failed to allocate space for badblocks\n",
+   disk->disk_name);
+   return;
+   }
+
+   if (badblocks_init(disk->bb, 1))
+   pr_warn("%s: failed to initialize badblocks\n",
+   disk->disk_name);
+}
+
 static void register_disk(struct gendisk *disk)
 {
struct device *ddev = disk_to_dev(disk);
@@ -609,6 +624,7 @@ void add_disk(struct gendisk *disk)
disk->first_minor = MINOR(devt);
 
disk_alloc_events(disk);
+   disk_alloc_badblocks(disk);
 
/* Register BDI before referencing it from bdev */
bdi = &disk->queue->backing_dev_info;
@@ -657,6 +673,11 @@ void del_gendisk(struct gendisk *disk)
blk_unregister_queue(disk);
blk_unregister_region(disk_devt(disk), disk->minors);
 
+   if (disk->bb) {
+   badblocks_free(disk->bb);
+   kfree(disk->bb);
+   }
+
part_stat_set_all(&disk->part0, 0);
disk->part0.stamp = 0;
 
@@ -670,6 +691,63 @@ void del_gendisk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(del_gendisk);
 
+/*
+ * The gendisk usage of badblocks does not track acknowledgements for
+ * badblocks. We always assume they are acknowledged.
+ */
+int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors,
+  sector_t *first_bad, int *bad_sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors);
+}
+EXPORT_SYMBOL(disk_check_badblocks);
+
+int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_set(disk->bb, s, sectors, 1);
+}
+EXPORT_SYMBOL(disk_set_badblocks);
+
+int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_clear(disk->bb, s, sectors);
+}
+EXPORT_SYMBOL(disk_clear_badblocks);
+
+/* sysfs access to bad-blocks list. */
+static ssize_t disk_badblocks_show(struct device *dev,
+   struct device_attribute *attr,
+   char *page)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_show(disk->bb, page, 0);
+}
+
+static ssize_t disk_badblocks_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *page, size_t len)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_store(disk->bb, page, len, 0);
+}
+
 /**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
@@ -988,6 +1066,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, 
disk_discard_alignment_show,
 static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
 static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
+   disk_badblocks_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1009,6 +1089,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_capability.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+   &dev_attr_badblocks.attr,
 #i

[PATCH v2 0/3] Badblock tracking for gendisks

2015-11-25 Thread Vishal Verma
v2:
  - In badblocks_free, make 'page' NULL (patch 1)
  - Move the core badblocks code to a new .c file (patch 1) (Jens)
  - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan)
  - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the
genhd wrappers (patch 2) (Jeff)
  - Update the md conversion to also ise the badblocks init and free
functions (patch 3)
  - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3)

Patch 1 copies badblock management code into a header of its own,
making it generally available. It follows common libraries of code
such as linked lists, where anyone may embed a core data structure
in another place, and use the provided accessor functions to
manipulate the data.

Patch 2 adds badblock tracking to gendisks (in preparation for use
by NVDIMM devices). Right now, it is turned on unconditionally - I'd
appreciate comments on if that is the right path.

Patch 3 converts md over to use the new badblocks 'library'. I have
done some pretty simple testing on this - created a raid 1 device,
made sure the sysfs entries show up, and can be used to add and view
badblocks. A closer look by the md folks would be nice here.


Vishal Verma (3):
  badblocks: Add core badblock management code
  block: Add badblock management for gendisks
  md: convert to use the generic badblocks code

 block/Makefile|   2 +-
 block/badblocks.c | 523 ++
 block/genhd.c |  81 +++
 drivers/md/md.c   | 507 ++--
 drivers/md/md.h   |  40 +---
 include/linux/badblocks.h |  53 +
 include/linux/genhd.h |   6 +
 7 files changed, 687 insertions(+), 525 deletions(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] md: convert to use the generic badblocks code

2015-11-25 Thread Vishal Verma
Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma 
---
 drivers/md/md.c | 507 +++-
 drivers/md/md.h |  40 +
 2 files changed, 23 insertions(+), 524 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c702de1..63eab20 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -34,6 +34,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -707,8 +708,7 @@ void md_rdev_clear(struct md_rdev *rdev)
put_page(rdev->bb_page);
rdev->bb_page = NULL;
}
-   kfree(rdev->badblocks.page);
-   rdev->badblocks.page = NULL;
+   badblocks_free(&rdev->badblocks);
 }
 EXPORT_SYMBOL_GPL(md_rdev_clear);
 
@@ -1358,8 +1358,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
return cpu_to_le32(csum);
 }
 
-static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
-   int acknowledged);
 static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int 
minor_version)
 {
struct mdp_superblock_1 *sb;
@@ -1484,7 +1482,7 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
count <<= sb->bblog_shift;
if (bb + 1 == 0)
break;
-   if (md_set_badblocks(&rdev->badblocks,
+   if (badblocks_set(&rdev->badblocks,
 sector, count, 1) == 0)
return -EINVAL;
}
@@ -2226,7 +2224,7 @@ repeat:
rdev_for_each(rdev, mddev) {
if (rdev->badblocks.changed) {
rdev->badblocks.changed = 0;
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
md_error(mddev, rdev);
}
clear_bit(Blocked, &rdev->flags);
@@ -2352,7 +2350,7 @@ repeat:
clear_bit(Blocked, &rdev->flags);
 
if (any_badblocks_changed)
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
clear_bit(BlockedBadBlocks, &rdev->flags);
wake_up(&rdev->blocked_wait);
}
@@ -2944,11 +2942,17 @@ static ssize_t recovery_start_store(struct md_rdev 
*rdev, const char *buf, size_
 static struct rdev_sysfs_entry rdev_recovery_start =
 __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, 
recovery_start_store);
 
-static ssize_t
-badblocks_show(struct badblocks *bb, char *page, int unack);
-static ssize_t
-badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
-
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ *are recorded as bad.  The list is truncated to fit within
+ *the one-page limit of sysfs.
+ *Writing "sector length" to this file adds an acknowledged
+ *bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ *been acknowledged.  Writing to this file adds bad blocks
+ *without acknowledging them.  This is largely for testing.
+ */
 static ssize_t bb_show(struct md_rdev *rdev, char *page)
 {
return badblocks_show(&rdev->badblocks, page, 0);
@@ -3063,14 +3067,7 @@ int md_rdev_init(struct md_rdev *rdev)
 * This reserves the space even on arrays where it cannot
 * be used - I wonder if that matters
 */
-   rdev->badblocks.count = 0;
-   rdev->badblocks.shift = -1; /* disabled until explicitly enabled */
-   rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   seqlock_init(&rdev->badblocks.lock);
-   if (rdev->badblocks.page == NULL)
-   return -ENOMEM;
-
-   return 0;
+   return badblocks_init(&rdev->badblocks, 0);
 }
 EXPORT_SYMBOL_GPL(md_rdev_init);
 /*
@@ -8348,253 +8345,7 @@ void md_finish_reshape(struct mddev *mddev)
 }
 EXPORT_SYMBOL(md_finish_reshape);
 
-/* Bad block management.
- * We can record which blocks on each device are 'bad' and so just
- * fail those blocks, or that stripe, rather than the whole device.
- * Entries in the bad-block table are 64bits wide.  This comprises:
- * Length of bad-range, in sectors: 0-511 for lengths 1-512
- * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
- *  A 'shift' can be set so that larger blocks are tracked and
- *  consequently larger

[PATCH v2 1/3] badblocks: Add core badblock management code

2015-11-25 Thread Vishal Verma
Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.

The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.

Signed-off-by: Vishal Verma 
---
 block/Makefile|   2 +-
 block/badblocks.c | 523 ++
 include/linux/badblocks.h |  53 +
 3 files changed, 577 insertions(+), 1 deletion(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

diff --git a/block/Makefile b/block/Makefile
index 00ecc97..db5f622 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
-   partitions/
+   badblocks.o partitions/
 
 obj-$(CONFIG_BOUNCE)   += bounce.o
 obj-$(CONFIG_BLK_DEV_BSG)  += bsg.o
diff --git a/block/badblocks.c b/block/badblocks.c
new file mode 100644
index 000..6e07855
--- /dev/null
+++ b/block/badblocks.c
@@ -0,0 +1,523 @@
+/*
+ * Bad block management
+ *
+ * - Heavily based on MD badblocks code from Neil Brown
+ *
+ * Copyright (c) 2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide.  This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ *  A 'shift' can be set so that larger blocks are tracked and
+ *  consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so badblocks_check
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad.  So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ *  0 if there are no known bad blocks in the range
+ *  1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+   sector_t *first_bad, int *bad_sectors)
+{
+   int hi;
+   int lo;
+   u64 *p = bb->page;
+   int rv;
+   sector_t target = s + sectors;
+   unsigned seq;
+
+   if (bb->shift > 0) {
+   /* round the start down, and the end up */
+   s >>= bb->shift;
+   target += (1<shift) - 1;
+   target >>= bb->shift;
+   sectors = target - s;
+   }
+   /* 'target' is now the first block after the bad range */
+
+retry:
+   seq = read_seqbegin(&bb->lock);
+   lo = 0;
+   rv = 0;
+   hi = bb->count;
+
+   /* Binary search between lo and hi for 'target'
+* i.e. for the last range that starts before 'target'
+*/
+   /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+* are known not to be the last range before target.
+* VARIANT: hi-lo is the number of possible
+* ranges, and decreases until it reaches 1
+*/
+   while (hi - lo > 1) {
+   int mid = (lo + hi) / 2;
+   sector_t a = BB_OFFSET(p[mid]);
+
+   if (a < target)
+   /* This could still be the one, earlier ranges
+* could not.
+ 

[PATCH v2 2/3] block: Add badblock management for gendisks

2015-12-07 Thread Vishal Verma
NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.

The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.

Use the new badblock management interfaces to add a badblocks list to
gendisks.

Signed-off-by: Vishal Verma 
---
 block/genhd.c | 76 +++
 include/linux/genhd.h |  7 +
 2 files changed, 83 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 0c706f3..809e3e2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "blk.h"
 
@@ -505,6 +506,16 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
+int disk_alloc_badblocks(struct gendisk *disk)
+{
+   disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL);
+   if (!disk->bb)
+   return -ENOMEM;
+
+   return badblocks_init(disk->bb, 1);
+}
+EXPORT_SYMBOL(disk_alloc_badblocks);
+
 static void register_disk(struct gendisk *disk)
 {
struct device *ddev = disk_to_dev(disk);
@@ -657,6 +668,11 @@ void del_gendisk(struct gendisk *disk)
blk_unregister_queue(disk);
blk_unregister_region(disk_devt(disk), disk->minors);
 
+   if (disk->bb) {
+   badblocks_free(disk->bb);
+   kfree(disk->bb);
+   }
+
part_stat_set_all(&disk->part0, 0);
disk->part0.stamp = 0;
 
@@ -670,6 +686,63 @@ void del_gendisk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(del_gendisk);
 
+/*
+ * The gendisk usage of badblocks does not track acknowledgements for
+ * badblocks. We always assume they are acknowledged.
+ */
+int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors,
+  sector_t *first_bad, int *bad_sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors);
+}
+EXPORT_SYMBOL(disk_check_badblocks);
+
+int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_set(disk->bb, s, sectors, 1);
+}
+EXPORT_SYMBOL(disk_set_badblocks);
+
+int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_clear(disk->bb, s, sectors);
+}
+EXPORT_SYMBOL(disk_clear_badblocks);
+
+/* sysfs access to bad-blocks list. */
+static ssize_t disk_badblocks_show(struct device *dev,
+   struct device_attribute *attr,
+   char *page)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_show(disk->bb, page, 0);
+}
+
+static ssize_t disk_badblocks_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *page, size_t len)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_store(disk->bb, page, len, 0);
+}
+
 /**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
@@ -988,6 +1061,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, 
disk_discard_alignment_show,
 static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
 static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
+   disk_badblocks_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1009,6 +1084,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_capability.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+   &dev_attr_badblocks.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
 #endif
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 2adbfa6..985eb94 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -162,6 +162,7 @@ struct disk_part_tbl {
 };
 
 struct disk_events;
+struct badblocks;
 
 struct gendisk {
/* major, first_minor and minors are input parameters only,
@@ -201,6 +202,7 @@ struct gendisk {
struct blk_integrity *integrity;
 #endif
int node_id;
+   s

[PATCH v2 1/3] badblocks: Add core badblock management code

2015-12-07 Thread Vishal Verma
Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.

The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.

Signed-off-by: Vishal Verma 
---
 block/Makefile|   2 +-
 block/badblocks.c | 576 ++
 include/linux/badblocks.h |  53 +
 3 files changed, 630 insertions(+), 1 deletion(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

diff --git a/block/Makefile b/block/Makefile
index 00ecc97..db5f622 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
-   partitions/
+   badblocks.o partitions/
 
 obj-$(CONFIG_BOUNCE)   += bounce.o
 obj-$(CONFIG_BLK_DEV_BSG)  += bsg.o
diff --git a/block/badblocks.c b/block/badblocks.c
new file mode 100644
index 000..f0ac279
--- /dev/null
+++ b/block/badblocks.c
@@ -0,0 +1,576 @@
+/*
+ * Bad block management
+ *
+ * - Heavily based on MD badblocks code from Neil Brown
+ *
+ * Copyright (c) 2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * badblocks_check() - check a given range for bad sectors
+ * @bb:the badblocks structure that holds all badblock 
information
+ * @s: sector (start) at which to check for badblocks
+ * @sectors:   number of sectors to check for badblocks
+ * @first_bad: pointer to store location of the first badblock
+ * @bad_sectors: pointer to store number of badblocks after @first_bad
+ *
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide.  This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ *  A 'shift' can be set so that larger blocks are tracked and
+ *  consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so badblocks_check
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad.  So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ *
+ * Return:
+ *  0: there are no known bad blocks in the range
+ *  1: there are known bad block which are all acknowledged
+ * -1: there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+   sector_t *first_bad, int *bad_sectors)
+{
+   int hi;
+   int lo;
+   u64 *p = bb->page;
+   int rv;
+   sector_t target = s + sectors;
+   unsigned seq;
+
+   if (bb->shift > 0) {
+   /* round the start down, and the end up */
+   s >>= bb->shift;
+   target += (1<shift) - 1;
+   target >>= bb->shift;
+   sectors = target - s;
+   }
+   /* 'target' is now the first block after the bad range */
+
+retry:
+   seq = read_seqbegin(&bb->lock);
+   lo = 0;
+   rv = 0;
+   hi = bb->count;
+
+   /* Binary search between lo and hi for 'target'
+* i.e. for the last range that starts before 'target'
+*/
+   /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+* are known not to be the last range before target.
+   

[PATCH v2 3/3] md: convert to use the generic badblocks code

2015-12-07 Thread Vishal Verma
Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma 
---
 drivers/md/md.c | 516 +++-
 drivers/md/md.h |  40 +
 2 files changed, 28 insertions(+), 528 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c702de1..afdc3ea 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -34,6 +34,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -707,8 +708,7 @@ void md_rdev_clear(struct md_rdev *rdev)
put_page(rdev->bb_page);
rdev->bb_page = NULL;
}
-   kfree(rdev->badblocks.page);
-   rdev->badblocks.page = NULL;
+   badblocks_free(&rdev->badblocks);
 }
 EXPORT_SYMBOL_GPL(md_rdev_clear);
 
@@ -1358,8 +1358,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
return cpu_to_le32(csum);
 }
 
-static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
-   int acknowledged);
 static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int 
minor_version)
 {
struct mdp_superblock_1 *sb;
@@ -1484,8 +1482,7 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
count <<= sb->bblog_shift;
if (bb + 1 == 0)
break;
-   if (md_set_badblocks(&rdev->badblocks,
-sector, count, 1) == 0)
+   if (badblocks_set(&rdev->badblocks, sector, count, 1))
return -EINVAL;
}
} else if (sb->bblog_offset != 0)
@@ -2226,7 +2223,7 @@ repeat:
rdev_for_each(rdev, mddev) {
if (rdev->badblocks.changed) {
rdev->badblocks.changed = 0;
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
md_error(mddev, rdev);
}
clear_bit(Blocked, &rdev->flags);
@@ -2352,7 +2349,7 @@ repeat:
clear_bit(Blocked, &rdev->flags);
 
if (any_badblocks_changed)
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
clear_bit(BlockedBadBlocks, &rdev->flags);
wake_up(&rdev->blocked_wait);
}
@@ -2944,11 +2941,17 @@ static ssize_t recovery_start_store(struct md_rdev 
*rdev, const char *buf, size_
 static struct rdev_sysfs_entry rdev_recovery_start =
 __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, 
recovery_start_store);
 
-static ssize_t
-badblocks_show(struct badblocks *bb, char *page, int unack);
-static ssize_t
-badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
-
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ *are recorded as bad.  The list is truncated to fit within
+ *the one-page limit of sysfs.
+ *Writing "sector length" to this file adds an acknowledged
+ *bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ *been acknowledged.  Writing to this file adds bad blocks
+ *without acknowledging them.  This is largely for testing.
+ */
 static ssize_t bb_show(struct md_rdev *rdev, char *page)
 {
return badblocks_show(&rdev->badblocks, page, 0);
@@ -3063,14 +3066,7 @@ int md_rdev_init(struct md_rdev *rdev)
 * This reserves the space even on arrays where it cannot
 * be used - I wonder if that matters
 */
-   rdev->badblocks.count = 0;
-   rdev->badblocks.shift = -1; /* disabled until explicitly enabled */
-   rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   seqlock_init(&rdev->badblocks.lock);
-   if (rdev->badblocks.page == NULL)
-   return -ENOMEM;
-
-   return 0;
+   return badblocks_init(&rdev->badblocks, 0);
 }
 EXPORT_SYMBOL_GPL(md_rdev_init);
 /*
@@ -8348,254 +8344,9 @@ void md_finish_reshape(struct mddev *mddev)
 }
 EXPORT_SYMBOL(md_finish_reshape);
 
-/* Bad block management.
- * We can record which blocks on each device are 'bad' and so just
- * fail those blocks, or that stripe, rather than the whole device.
- * Entries in the bad-block table are 64bits wide.  This comprises:
- * Length of bad-range, in sectors: 0-511 for lengths 1-512
- * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
- *  A 'shift' can be 

[PATCH v2 0/3] Badblock tracking for gendisks

2015-12-07 Thread Vishal Verma
v3:
  - Add kernel-doc style comments to all exported functions in badblocks.c 
(James)
  - Make return values from badblocks functions consistent with themselves
and the kernel style. Change the polarity of badblocks_set, and update
all callers accordingly (James)
  - In gendisk, don't unconditionally allocate badblocks, export the 
initializer.
This also allows the initializer to be a non-void return type, so that the
badblocks user can act upon failures better (James)


v2:
  - In badblocks_free, make 'page' NULL (patch 1)
  - Move the core badblocks code to a new .c file (patch 1) (Jens)
  - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan)
  - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the
genhd wrappers (patch 2) (Jeff)
  - Update the md conversion to also ise the badblocks init and free
functions (patch 3)
  - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3)

Patch 1 copies badblock management code into a header of its own,
making it generally available. It follows common libraries of code
such as linked lists, where anyone may embed a core data structure
in another place, and use the provided accessor functions to
manipulate the data.

Patch 2 adds badblock tracking to gendisks (in preparation for use
by NVDIMM devices).

Patch 3 converts md over to use the new badblocks 'library'. I have
done some pretty simple testing on this - created a raid 1 device,
made sure the sysfs entries show up, and can be used to add and view
badblocks. A closer look by the md folks would be nice here.

Vishal Verma (3):
  badblocks: Add core badblock management code
  block: Add badblock management for gendisks
  md: convert to use the generic badblocks code

 block/Makefile|   2 +-
 block/badblocks.c | 576 ++
 block/genhd.c |  76 ++
 drivers/md/md.c   | 516 ++---
 drivers/md/md.h   |  40 +---
 include/linux/badblocks.h |  53 +
 include/linux/genhd.h |   7 +
 7 files changed, 741 insertions(+), 529 deletions(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/3] Badblock tracking for gendisks

2015-12-08 Thread Vishal Verma
v4:
  - Rebase to v4.4-rc4

v3:
  - Add kernel-doc style comments to all exported functions in badblocks.c 
(James)
  - Make return values from badblocks functions consistent with themselves
and the kernel style. Change the polarity of badblocks_set, and update
all callers accordingly (James)
  - In gendisk, don't unconditionally allocate badblocks, export the 
initializer.
This also allows the initializer to be a non-void return type, so that the
badblocks user can act upon failures better (James)


v2:
  - In badblocks_free, make 'page' NULL (patch 1)
  - Move the core badblocks code to a new .c file (patch 1) (Jens)
  - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan)
  - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the
genhd wrappers (patch 2) (Jeff)
  - Update the md conversion to also ise the badblocks init and free
functions (patch 3)
  - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3)

Patch 1 copies badblock management code into a header of its own,
making it generally available. It follows common libraries of code
such as linked lists, where anyone may embed a core data structure
in another place, and use the provided accessor functions to
manipulate the data.

Patch 2 adds badblock tracking to gendisks (in preparation for use
by NVDIMM devices).

Patch 3 converts md over to use the new badblocks 'library'. I have
done some pretty simple testing on this - created a raid 1 device,
made sure the sysfs entries show up, and can be used to add and view
badblocks. A closer look by the md folks would be nice here.


Vishal Verma (3):
  badblocks: Add core badblock management code
  block: Add badblock management for gendisks
  md: convert to use the generic badblocks code

 block/Makefile|   2 +-
 block/badblocks.c | 576 ++
 block/genhd.c |  76 ++
 drivers/md/md.c   | 516 ++---
 drivers/md/md.h   |  40 +---
 include/linux/badblocks.h |  53 +
 include/linux/genhd.h |   7 +
 7 files changed, 741 insertions(+), 529 deletions(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/3] md: convert to use the generic badblocks code

2015-12-08 Thread Vishal Verma
Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma 
---
 drivers/md/md.c | 516 +++-
 drivers/md/md.h |  40 +
 2 files changed, 28 insertions(+), 528 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 807095f..1e48aa9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -34,6 +34,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -709,8 +710,7 @@ void md_rdev_clear(struct md_rdev *rdev)
put_page(rdev->bb_page);
rdev->bb_page = NULL;
}
-   kfree(rdev->badblocks.page);
-   rdev->badblocks.page = NULL;
+   badblocks_free(&rdev->badblocks);
 }
 EXPORT_SYMBOL_GPL(md_rdev_clear);
 
@@ -1360,8 +1360,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
return cpu_to_le32(csum);
 }
 
-static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
-   int acknowledged);
 static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int 
minor_version)
 {
struct mdp_superblock_1 *sb;
@@ -1486,8 +1484,7 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
count <<= sb->bblog_shift;
if (bb + 1 == 0)
break;
-   if (md_set_badblocks(&rdev->badblocks,
-sector, count, 1) == 0)
+   if (badblocks_set(&rdev->badblocks, sector, count, 1))
return -EINVAL;
}
} else if (sb->bblog_offset != 0)
@@ -2319,7 +2316,7 @@ repeat:
rdev_for_each(rdev, mddev) {
if (rdev->badblocks.changed) {
rdev->badblocks.changed = 0;
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
md_error(mddev, rdev);
}
clear_bit(Blocked, &rdev->flags);
@@ -2445,7 +2442,7 @@ repeat:
clear_bit(Blocked, &rdev->flags);
 
if (any_badblocks_changed)
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
clear_bit(BlockedBadBlocks, &rdev->flags);
wake_up(&rdev->blocked_wait);
}
@@ -3046,11 +3043,17 @@ static ssize_t recovery_start_store(struct md_rdev 
*rdev, const char *buf, size_
 static struct rdev_sysfs_entry rdev_recovery_start =
 __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, 
recovery_start_store);
 
-static ssize_t
-badblocks_show(struct badblocks *bb, char *page, int unack);
-static ssize_t
-badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
-
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ *are recorded as bad.  The list is truncated to fit within
+ *the one-page limit of sysfs.
+ *Writing "sector length" to this file adds an acknowledged
+ *bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ *been acknowledged.  Writing to this file adds bad blocks
+ *without acknowledging them.  This is largely for testing.
+ */
 static ssize_t bb_show(struct md_rdev *rdev, char *page)
 {
return badblocks_show(&rdev->badblocks, page, 0);
@@ -3165,14 +3168,7 @@ int md_rdev_init(struct md_rdev *rdev)
 * This reserves the space even on arrays where it cannot
 * be used - I wonder if that matters
 */
-   rdev->badblocks.count = 0;
-   rdev->badblocks.shift = -1; /* disabled until explicitly enabled */
-   rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   seqlock_init(&rdev->badblocks.lock);
-   if (rdev->badblocks.page == NULL)
-   return -ENOMEM;
-
-   return 0;
+   return badblocks_init(&rdev->badblocks, 0);
 }
 EXPORT_SYMBOL_GPL(md_rdev_init);
 /*
@@ -8478,254 +8474,9 @@ void md_finish_reshape(struct mddev *mddev)
 }
 EXPORT_SYMBOL(md_finish_reshape);
 
-/* Bad block management.
- * We can record which blocks on each device are 'bad' and so just
- * fail those blocks, or that stripe, rather than the whole device.
- * Entries in the bad-block table are 64bits wide.  This comprises:
- * Length of bad-range, in sectors: 0-511 for lengths 1-512
- * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
- *  A 'shift' can be 

[PATCH v4 2/3] block: Add badblock management for gendisks

2015-12-08 Thread Vishal Verma
NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.

The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.

Use the new badblock management interfaces to add a badblocks list to
gendisks.

Signed-off-by: Vishal Verma 
---
 block/genhd.c | 76 +++
 include/linux/genhd.h |  7 +
 2 files changed, 83 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa5..81dcf32 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "blk.h"
 
@@ -505,6 +506,16 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
+int disk_alloc_badblocks(struct gendisk *disk)
+{
+   disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL);
+   if (!disk->bb)
+   return -ENOMEM;
+
+   return badblocks_init(disk->bb, 1);
+}
+EXPORT_SYMBOL(disk_alloc_badblocks);
+
 static void register_disk(struct gendisk *disk)
 {
struct device *ddev = disk_to_dev(disk);
@@ -659,6 +670,11 @@ void del_gendisk(struct gendisk *disk)
blk_unregister_queue(disk);
blk_unregister_region(disk_devt(disk), disk->minors);
 
+   if (disk->bb) {
+   badblocks_free(disk->bb);
+   kfree(disk->bb);
+   }
+
part_stat_set_all(&disk->part0, 0);
disk->part0.stamp = 0;
 
@@ -672,6 +688,63 @@ void del_gendisk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(del_gendisk);
 
+/*
+ * The gendisk usage of badblocks does not track acknowledgements for
+ * badblocks. We always assume they are acknowledged.
+ */
+int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors,
+  sector_t *first_bad, int *bad_sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors);
+}
+EXPORT_SYMBOL(disk_check_badblocks);
+
+int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_set(disk->bb, s, sectors, 1);
+}
+EXPORT_SYMBOL(disk_set_badblocks);
+
+int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_clear(disk->bb, s, sectors);
+}
+EXPORT_SYMBOL(disk_clear_badblocks);
+
+/* sysfs access to bad-blocks list. */
+static ssize_t disk_badblocks_show(struct device *dev,
+   struct device_attribute *attr,
+   char *page)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_show(disk->bb, page, 0);
+}
+
+static ssize_t disk_badblocks_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *page, size_t len)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_store(disk->bb, page, len, 0);
+}
+
 /**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
@@ -990,6 +1063,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, 
disk_discard_alignment_show,
 static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
 static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
+   disk_badblocks_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1011,6 +1086,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_capability.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+   &dev_attr_badblocks.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
 #endif
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 847cc1d..0bbec68 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -162,6 +162,7 @@ struct disk_part_tbl {
 };
 
 struct disk_events;
+struct badblocks;
 
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 
@@ -213,6 +214,7 @@ struct gendisk {
struct kobject integrity_kobj;
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
int node_id;
+   struct badblocks *b

[PATCH v4 1/3] badblocks: Add core badblock management code

2015-12-08 Thread Vishal Verma
Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.

The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.

Signed-off-by: Vishal Verma 
---
 block/Makefile|   2 +-
 block/badblocks.c | 576 ++
 include/linux/badblocks.h |  53 +
 3 files changed, 630 insertions(+), 1 deletion(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

diff --git a/block/Makefile b/block/Makefile
index 00ecc97..db5f622 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
-   partitions/
+   badblocks.o partitions/
 
 obj-$(CONFIG_BOUNCE)   += bounce.o
 obj-$(CONFIG_BLK_DEV_BSG)  += bsg.o
diff --git a/block/badblocks.c b/block/badblocks.c
new file mode 100644
index 000..f0ac279
--- /dev/null
+++ b/block/badblocks.c
@@ -0,0 +1,576 @@
+/*
+ * Bad block management
+ *
+ * - Heavily based on MD badblocks code from Neil Brown
+ *
+ * Copyright (c) 2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * badblocks_check() - check a given range for bad sectors
+ * @bb:the badblocks structure that holds all badblock 
information
+ * @s: sector (start) at which to check for badblocks
+ * @sectors:   number of sectors to check for badblocks
+ * @first_bad: pointer to store location of the first badblock
+ * @bad_sectors: pointer to store number of badblocks after @first_bad
+ *
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide.  This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ *  A 'shift' can be set so that larger blocks are tracked and
+ *  consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so badblocks_check
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad.  So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ *
+ * Return:
+ *  0: there are no known bad blocks in the range
+ *  1: there are known bad block which are all acknowledged
+ * -1: there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+   sector_t *first_bad, int *bad_sectors)
+{
+   int hi;
+   int lo;
+   u64 *p = bb->page;
+   int rv;
+   sector_t target = s + sectors;
+   unsigned seq;
+
+   if (bb->shift > 0) {
+   /* round the start down, and the end up */
+   s >>= bb->shift;
+   target += (1<shift) - 1;
+   target >>= bb->shift;
+   sectors = target - s;
+   }
+   /* 'target' is now the first block after the bad range */
+
+retry:
+   seq = read_seqbegin(&bb->lock);
+   lo = 0;
+   rv = 0;
+   hi = bb->count;
+
+   /* Binary search between lo and hi for 'target'
+* i.e. for the last range that starts before 'target'
+*/
+   /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+* are known not to be the last range before target.
+   

[PATCH v5 2/3] block: Add badblock management for gendisks

2015-12-24 Thread Vishal Verma
NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.

The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.

Use the new badblock management interfaces to add a badblocks list to
gendisks.

Signed-off-by: Vishal Verma 
---
 block/genhd.c | 76 +++
 include/linux/genhd.h |  7 +
 2 files changed, 83 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa5..81dcf32 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "blk.h"
 
@@ -505,6 +506,16 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
+int disk_alloc_badblocks(struct gendisk *disk)
+{
+   disk->bb = kzalloc(sizeof(*(disk->bb)), GFP_KERNEL);
+   if (!disk->bb)
+   return -ENOMEM;
+
+   return badblocks_init(disk->bb, 1);
+}
+EXPORT_SYMBOL(disk_alloc_badblocks);
+
 static void register_disk(struct gendisk *disk)
 {
struct device *ddev = disk_to_dev(disk);
@@ -659,6 +670,11 @@ void del_gendisk(struct gendisk *disk)
blk_unregister_queue(disk);
blk_unregister_region(disk_devt(disk), disk->minors);
 
+   if (disk->bb) {
+   badblocks_free(disk->bb);
+   kfree(disk->bb);
+   }
+
part_stat_set_all(&disk->part0, 0);
disk->part0.stamp = 0;
 
@@ -672,6 +688,63 @@ void del_gendisk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(del_gendisk);
 
+/*
+ * The gendisk usage of badblocks does not track acknowledgements for
+ * badblocks. We always assume they are acknowledged.
+ */
+int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors,
+  sector_t *first_bad, int *bad_sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors);
+}
+EXPORT_SYMBOL(disk_check_badblocks);
+
+int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_set(disk->bb, s, sectors, 1);
+}
+EXPORT_SYMBOL(disk_set_badblocks);
+
+int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors)
+{
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_clear(disk->bb, s, sectors);
+}
+EXPORT_SYMBOL(disk_clear_badblocks);
+
+/* sysfs access to bad-blocks list. */
+static ssize_t disk_badblocks_show(struct device *dev,
+   struct device_attribute *attr,
+   char *page)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_show(disk->bb, page, 0);
+}
+
+static ssize_t disk_badblocks_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *page, size_t len)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (!disk->bb)
+   return 0;
+
+   return badblocks_store(disk->bb, page, len, 0);
+}
+
 /**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
@@ -990,6 +1063,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, 
disk_discard_alignment_show,
 static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
 static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
+   disk_badblocks_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1011,6 +1086,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_capability.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+   &dev_attr_badblocks.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
 #endif
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 847cc1d..0bbec68 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -162,6 +162,7 @@ struct disk_part_tbl {
 };
 
 struct disk_events;
+struct badblocks;
 
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 
@@ -213,6 +214,7 @@ struct gendisk {
struct kobject integrity_kobj;
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
int node_id;
+   struct badblocks *b

[PATCH v5 3/3] md: convert to use the generic badblocks code

2015-12-24 Thread Vishal Verma
Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma 
---
 drivers/md/md.c | 516 +++-
 drivers/md/md.h |  40 +
 2 files changed, 28 insertions(+), 528 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index dbedc58..51dc9f3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -34,6 +34,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -710,8 +711,7 @@ void md_rdev_clear(struct md_rdev *rdev)
put_page(rdev->bb_page);
rdev->bb_page = NULL;
}
-   kfree(rdev->badblocks.page);
-   rdev->badblocks.page = NULL;
+   badblocks_free(&rdev->badblocks);
 }
 EXPORT_SYMBOL_GPL(md_rdev_clear);
 
@@ -1361,8 +1361,6 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
return cpu_to_le32(csum);
 }
 
-static int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
-   int acknowledged);
 static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int 
minor_version)
 {
struct mdp_superblock_1 *sb;
@@ -1487,8 +1485,7 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
count <<= sb->bblog_shift;
if (bb + 1 == 0)
break;
-   if (md_set_badblocks(&rdev->badblocks,
-sector, count, 1) == 0)
+   if (badblocks_set(&rdev->badblocks, sector, count, 1))
return -EINVAL;
}
} else if (sb->bblog_offset != 0)
@@ -2320,7 +2317,7 @@ repeat:
rdev_for_each(rdev, mddev) {
if (rdev->badblocks.changed) {
rdev->badblocks.changed = 0;
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
md_error(mddev, rdev);
}
clear_bit(Blocked, &rdev->flags);
@@ -2446,7 +2443,7 @@ repeat:
clear_bit(Blocked, &rdev->flags);
 
if (any_badblocks_changed)
-   md_ack_all_badblocks(&rdev->badblocks);
+   ack_all_badblocks(&rdev->badblocks);
clear_bit(BlockedBadBlocks, &rdev->flags);
wake_up(&rdev->blocked_wait);
}
@@ -3054,11 +3051,17 @@ static ssize_t recovery_start_store(struct md_rdev 
*rdev, const char *buf, size_
 static struct rdev_sysfs_entry rdev_recovery_start =
 __ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, 
recovery_start_store);
 
-static ssize_t
-badblocks_show(struct badblocks *bb, char *page, int unack);
-static ssize_t
-badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
-
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ *are recorded as bad.  The list is truncated to fit within
+ *the one-page limit of sysfs.
+ *Writing "sector length" to this file adds an acknowledged
+ *bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ *been acknowledged.  Writing to this file adds bad blocks
+ *without acknowledging them.  This is largely for testing.
+ */
 static ssize_t bb_show(struct md_rdev *rdev, char *page)
 {
return badblocks_show(&rdev->badblocks, page, 0);
@@ -3173,14 +3176,7 @@ int md_rdev_init(struct md_rdev *rdev)
 * This reserves the space even on arrays where it cannot
 * be used - I wonder if that matters
 */
-   rdev->badblocks.count = 0;
-   rdev->badblocks.shift = -1; /* disabled until explicitly enabled */
-   rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   seqlock_init(&rdev->badblocks.lock);
-   if (rdev->badblocks.page == NULL)
-   return -ENOMEM;
-
-   return 0;
+   return badblocks_init(&rdev->badblocks, 0);
 }
 EXPORT_SYMBOL_GPL(md_rdev_init);
 /*
@@ -8486,254 +8482,9 @@ void md_finish_reshape(struct mddev *mddev)
 }
 EXPORT_SYMBOL(md_finish_reshape);
 
-/* Bad block management.
- * We can record which blocks on each device are 'bad' and so just
- * fail those blocks, or that stripe, rather than the whole device.
- * Entries in the bad-block table are 64bits wide.  This comprises:
- * Length of bad-range, in sectors: 0-511 for lengths 1-512
- * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
- *  A 'shift' can be 

[PATCH v5 1/3] badblocks: Add core badblock management code

2015-12-24 Thread Vishal Verma
Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.

The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.

Signed-off-by: Vishal Verma 
---
 block/Makefile|   2 +-
 block/badblocks.c | 561 ++
 include/linux/badblocks.h |  53 +
 3 files changed, 615 insertions(+), 1 deletion(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

diff --git a/block/Makefile b/block/Makefile
index 00ecc97..db5f622 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o 
blk-sysfs.o \
blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
-   partitions/
+   badblocks.o partitions/
 
 obj-$(CONFIG_BOUNCE)   += bounce.o
 obj-$(CONFIG_BLK_DEV_BSG)  += bsg.o
diff --git a/block/badblocks.c b/block/badblocks.c
new file mode 100644
index 000..96aeb91
--- /dev/null
+++ b/block/badblocks.c
@@ -0,0 +1,561 @@
+/*
+ * Bad block management
+ *
+ * - Heavily based on MD badblocks code from Neil Brown
+ *
+ * Copyright (c) 2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * badblocks_check() - check a given range for bad sectors
+ * @bb:the badblocks structure that holds all badblock 
information
+ * @s: sector (start) at which to check for badblocks
+ * @sectors:   number of sectors to check for badblocks
+ * @first_bad: pointer to store location of the first badblock
+ * @bad_sectors: pointer to store number of badblocks after @first_bad
+ *
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide.  This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ *  A 'shift' can be set so that larger blocks are tracked and
+ *  consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so badblocks_check
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad.  So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ *
+ * Return:
+ *  0: there are no known bad blocks in the range
+ *  1: there are known bad block which are all acknowledged
+ * -1: there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+   sector_t *first_bad, int *bad_sectors)
+{
+   int hi;
+   int lo;
+   u64 *p = bb->page;
+   int rv;
+   sector_t target = s + sectors;
+   unsigned seq;
+
+   if (bb->shift > 0) {
+   /* round the start down, and the end up */
+   s >>= bb->shift;
+   target += (1<shift) - 1;
+   target >>= bb->shift;
+   sectors = target - s;
+   }
+   /* 'target' is now the first block after the bad range */
+
+retry:
+   seq = read_seqbegin(&bb->lock);
+   lo = 0;
+   rv = 0;
+   hi = bb->count;
+
+   /* Binary search between lo and hi for 'target'
+* i.e. for the last range that starts before 'target'
+*/
+   /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+* are known not to be the last range before target.
+   

[PATCH v5 0/3] Badblock tracking for gendisks

2015-12-24 Thread Vishal Verma
v5:
  - Rebase to v4.4-rc6
  - Revert back to using kzalloc from __get_free_page based on the discussion 
at:
http://thread.gmane.org/gmane.linux.kernel/2113292

v4:
  - Rebase to v4.4-rc4

v3:
  - Add kernel-doc style comments to all exported functions in badblocks.c 
(James)
  - Make return values from badblocks functions consistent with themselves
and the kernel style. Change the polarity of badblocks_set, and update
all callers accordingly (James)
  - In gendisk, don't unconditionally allocate badblocks, export the 
initializer.
This also allows the initializer to be a non-void return type, so that the
badblocks user can act upon failures better (James)


v2:
  - In badblocks_free, make 'page' NULL (patch 1)
  - Move the core badblocks code to a new .c file (patch 1) (Jens)
  - Fix a sizeof usage in disk_alloc_badblocks (patch 2) (Dan)
  - Since disk_alloc_badblocks can fail, check disk->bb for NULL in the
genhd wrappers (patch 2) (Jeff)
  - Update the md conversion to also ise the badblocks init and free
functions (patch 3)
  - Remove the BB_* macros from md.h as they are now in badblocks.h (patch 3)

Patch 1 copies badblock management code into a header of its own,
making it generally available. It follows common libraries of code
such as linked lists, where anyone may embed a core data structure
in another place, and use the provided accessor functions to
manipulate the data.

Patch 2 adds badblock tracking to gendisks (in preparation for use
by NVDIMM devices).

Patch 3 converts md over to use the new badblocks 'library'. I have
done some pretty simple testing on this - created a raid 1 device,
made sure the sysfs entries show up, and can be used to add and view
badblocks. A closer look by the md folks would be nice here.


Vishal Verma (3):
  badblocks: Add core badblock management code
  block: Add badblock management for gendisks
  md: convert to use the generic badblocks code

 block/Makefile|   2 +-
 block/badblocks.c | 561 ++
 block/genhd.c |  76 +++
 drivers/md/md.c   | 516 +++---
 drivers/md/md.h   |  40 +---
 include/linux/badblocks.h |  53 +
 include/linux/genhd.h |   7 +
 7 files changed, 726 insertions(+), 529 deletions(-)
 create mode 100644 block/badblocks.c
 create mode 100644 include/linux/badblocks.h

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html