Re: Block device integrity support

2017-06-03 Thread Christoph Hellwig
On Fri, Jun 02, 2017 at 03:52:57PM -0700, Jens Axboe wrote:
> On 06/02/2017 03:38 PM, Bart Van Assche wrote:
> > On Fri, 2017-06-02 at 18:24 -0400, Martin K. Petersen wrote:
> >>> then the output shown below appears in the kernel log. Does anyone know 
> >>> how
> >>> to fix this? Sorry but I'm not really familiar with the integrity
> >>> code.
> >>
> >> Dmitry posted a fix for this a few weeks ago.
> > 
> > Ah, that's right. Jens, had you noticed this message:
> > https://www.spinics.net/lists/fstests/msg06214.html?
> 
> No, unfortunately not. I'll queue it up for 4.13 and mark it
> stable for 4.12.

Can we please get it into 4.12-rc?


[PATCH 01/13] nvme-lightnvm: use blk_execute_rq in nvme_nvm_submit_user_cmd

2017-06-03 Thread Christoph Hellwig
Instead of reinventing it poorly.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
Reviewed-by: Javier González 
---
 drivers/nvme/host/lightnvm.c | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index f5df78ed1e10..f3885b5e56bd 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -571,13 +571,6 @@ static struct nvm_dev_ops nvme_nvm_dev_ops = {
.max_phys_sect  = 64,
 };
 
-static void nvme_nvm_end_user_vio(struct request *rq, int error)
-{
-   struct completion *waiting = rq->end_io_data;
-
-   complete(waiting);
-}
-
 static int nvme_nvm_submit_user_cmd(struct request_queue *q,
struct nvme_ns *ns,
struct nvme_nvm_command *vcmd,
@@ -608,7 +601,6 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
rq->timeout = timeout ? timeout : ADMIN_TIMEOUT;
 
rq->cmd_flags &= ~REQ_FAILFAST_DRIVER;
-   rq->end_io_data = &wait;
 
if (ppa_buf && ppa_len) {
ppa_list = dma_pool_alloc(dev->dma_pool, GFP_KERNEL, &ppa_dma);
@@ -662,9 +654,7 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
}
 
 submit:
-   blk_execute_rq_nowait(q, NULL, rq, 0, nvme_nvm_end_user_vio);
-
-   wait_for_completion_io(&wait);
+   blk_execute_rq(q, NULL, rq, 0);
 
if (nvme_req(rq)->flags & NVME_REQ_CANCELLED)
ret = -EINTR;
-- 
2.11.0



[PATCH 02/13] scsi/osd: don't save block errors into req_results

2017-06-03 Thread Christoph Hellwig
We will only have sense data if the command executed and got a SCSI
result, so this is pointless.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
---
 drivers/scsi/osd/osd_initiator.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/osd/osd_initiator.c b/drivers/scsi/osd/osd_initiator.c
index 8a1b94816419..14785177ce7b 100644
--- a/drivers/scsi/osd/osd_initiator.c
+++ b/drivers/scsi/osd/osd_initiator.c
@@ -477,7 +477,7 @@ static void _set_error_resid(struct osd_request *or, struct 
request *req,
 int error)
 {
or->async_error = error;
-   or->req_errors = scsi_req(req)->result ? : error;
+   or->req_errors = scsi_req(req)->result;
or->sense_len = scsi_req(req)->sense_len;
if (or->sense_len)
memcpy(or->sense, scsi_req(req)->sense, or->sense_len);
-- 
2.11.0



dedicated error codes for the block layer V3

2017-06-03 Thread Christoph Hellwig
This series introduces a new blk_status_t error code type for the block
layer so that we can have tigher control and explicit semantics for
block layer errors.

All but the last three patches are cleanups that lead to the new type.

The series it mostly limited to the block layer and drivers, and touching
file systems a little bit.  The only major exception is btrfs, which
does funny things with bios and thus sees a larger amount of propagation
of the new blk_status_t.

A git tree is also available at:

git://git.infradead.org/users/hch/block.git block-errors

gitweb:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-errors

Note the the two biggest patches didn't make it to linux-block and
linux-btrfs last time.  If you didn't get them they are available in
the git tree above.  Unfortunately there is no easy way to split them
up.

Changes since V2:
 - minor tweaks from reviews

Changes since V1: 
 - keep blk_types.h for now
 - removed a BUG_ON in dm-mpath


[PATCH 03/13] gfs2: remove the unused sd_log_error field

2017-06-03 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
---
 fs/gfs2/incore.h | 1 -
 fs/gfs2/lops.c   | 4 +---
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b7cf65d13561..aa3d44527fa2 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -815,7 +815,6 @@ struct gfs2_sbd {
atomic_t sd_log_in_flight;
struct bio *sd_log_bio;
wait_queue_head_t sd_log_flush_wait;
-   int sd_log_error;
 
atomic_t sd_reserving_log;
wait_queue_head_t sd_reserving_log_wait;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index b1f9144b42c7..13ebf15a4db0 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -209,10 +209,8 @@ static void gfs2_end_log_write(struct bio *bio)
struct page *page;
int i;
 
-   if (bio->bi_error) {
-   sdp->sd_log_error = bio->bi_error;
+   if (bio->bi_error)
fs_err(sdp, "Error %d writing to log\n", bio->bi_error);
-   }
 
bio_for_each_segment_all(bvec, bio, i) {
page = bvec->bv_page;
-- 
2.11.0



[PATCH 04/13] dm: fix REQ_RAHEAD handling

2017-06-03 Thread Christoph Hellwig
A few (but not all) dm targets use a special EWOULDBLOCK error code for
failing REQ_RAHEAD requests that fail due to a lack of available resources.
But no one else knows about this magic code, and lower level drivers also
don't generate it when failing read-ahead requests for similar reasons.

So remove this special casing and ignore all additional error handling for
REQ_RAHEAD - if this was a real underlying error we'd get a normal read
once the real read comes in.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
---
 drivers/md/dm-raid1.c  | 4 ++--
 drivers/md/dm-stripe.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index a95cbb80fb34..5e30b08b91d9 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1214,7 +1214,7 @@ static int mirror_map(struct dm_target *ti, struct bio 
*bio)
 */
if (!r || (r == -EWOULDBLOCK)) {
if (bio->bi_opf & REQ_RAHEAD)
-   return -EWOULDBLOCK;
+   return -EIO;
 
queue_bio(ms, bio, rw);
return DM_MAPIO_SUBMITTED;
@@ -1258,7 +1258,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio 
*bio, int error)
if (error == -EOPNOTSUPP)
return error;
 
-   if ((error == -EWOULDBLOCK) && (bio->bi_opf & REQ_RAHEAD))
+   if (bio->bi_opf & REQ_RAHEAD)
return error;
 
if (unlikely(error)) {
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 75152482f3ad..780e95889a7c 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -384,7 +384,7 @@ static int stripe_end_io(struct dm_target *ti, struct bio 
*bio, int error)
if (!error)
return 0; /* I/O complete */
 
-   if ((error == -EWOULDBLOCK) && (bio->bi_opf & REQ_RAHEAD))
+   if (bio->bi_opf & REQ_RAHEAD)
return error;
 
if (error == -EOPNOTSUPP)
-- 
2.11.0



[PATCH 07/13] block_dev: propagate bio_iov_iter_get_pages error in __blkdev_direct_IO

2017-06-03 Thread Christoph Hellwig
Once we move the block layer to its own status code we'll still want to
propagate the bio_iov_iter_get_pages, so restructure __blkdev_direct_IO
to take ret into account when returning the errno.

Signed-off-by: Christoph Hellwig 
---
 fs/block_dev.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 51959936..c1dc393ad6b9 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -334,7 +334,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bool is_read = (iov_iter_rw(iter) == READ), is_sync;
loff_t pos = iocb->ki_pos;
blk_qc_t qc = BLK_QC_T_NONE;
-   int ret;
+   int ret = 0;
 
if ((pos | iov_iter_alignment(iter)) &
(bdev_logical_block_size(bdev) - 1))
@@ -363,7 +363,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
 
ret = bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret)) {
-   bio->bi_error = ret;
+   bio->bi_error = -EIO;
bio_endio(bio);
break;
}
@@ -412,7 +412,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
}
__set_current_state(TASK_RUNNING);
 
-   ret = dio->bio.bi_error;
+   if (!ret)
+   ret = dio->bio.bi_error;
if (likely(!ret))
ret = dio->size;
 
-- 
2.11.0



[PATCH 08/13] dm mpath: merge do_end_io_bio into multipath_end_io_bio

2017-06-03 Thread Christoph Hellwig
This simplifies the code and especially the error passing a bit and
will help with the next patch.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-mpath.c | 42 +++---
 1 file changed, 15 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 3df056b73b66..6d5ebb76149d 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1510,24 +1510,24 @@ static int multipath_end_io(struct dm_target *ti, 
struct request *clone,
return r;
 }
 
-static int do_end_io_bio(struct multipath *m, struct bio *clone,
-int error, struct dm_mpath_io *mpio)
+static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, int 
error)
 {
+   struct multipath *m = ti->private;
+   struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
+   struct pgpath *pgpath = mpio->pgpath;
unsigned long flags;
 
-   if (!error)
-   return 0;   /* I/O complete */
-
-   if (noretry_error(error))
-   return error;
+   if (!error || noretry_error(error))
+   goto done;
 
-   if (mpio->pgpath)
-   fail_path(mpio->pgpath);
+   if (pgpath)
+   fail_path(pgpath);
 
if (atomic_read(&m->nr_valid_paths) == 0 &&
!test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
dm_report_EIO(m);
-   return -EIO;
+   error = -EIO;
+   goto done;
}
 
/* Queue for the daemon to resubmit */
@@ -1539,28 +1539,16 @@ static int do_end_io_bio(struct multipath *m, struct 
bio *clone,
if (!test_bit(MPATHF_QUEUE_IO, &m->flags))
queue_work(kmultipathd, &m->process_queued_bios);
 
-   return DM_ENDIO_INCOMPLETE;
-}
-
-static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, int 
error)
-{
-   struct multipath *m = ti->private;
-   struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
-   struct pgpath *pgpath;
-   struct path_selector *ps;
-   int r;
-
-   BUG_ON(!mpio);
-
-   r = do_end_io_bio(m, clone, error, mpio);
-   pgpath = mpio->pgpath;
+   error = DM_ENDIO_INCOMPLETE;
+done:
if (pgpath) {
-   ps = &pgpath->pg->ps;
+   struct path_selector *ps = &pgpath->pg->ps;
+
if (ps->type->end_io)
ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
}
 
-   return r;
+   return error;
 }
 
 /*
-- 
2.11.0



[PATCH 09/13] dm: don't return errnos from ->map

2017-06-03 Thread Christoph Hellwig
Instead use the special DM_MAPIO_KILL return value to return -EIO just
like we do for the request based path.  Note that dm-log-writes returned
-ENOMEM in a few places, which now becomes -EIO instead.  No consumer
treats -ENOMEM special so this shouldn't be an issue (and it should
use a mempool to start with to make guaranteed progress).

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-crypt.c |  4 ++--
 drivers/md/dm-flakey.c|  4 ++--
 drivers/md/dm-integrity.c | 12 ++--
 drivers/md/dm-log-writes.c|  4 ++--
 drivers/md/dm-mpath.c | 13 ++---
 drivers/md/dm-raid1.c |  6 +++---
 drivers/md/dm-snap.c  |  8 
 drivers/md/dm-target.c|  2 +-
 drivers/md/dm-verity-target.c |  6 +++---
 drivers/md/dm-zero.c  |  4 ++--
 drivers/md/dm.c   | 16 +++-
 11 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ebf9e72d479b..f4b51809db21 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2795,10 +2795,10 @@ static int crypt_map(struct dm_target *ti, struct bio 
*bio)
 * and is aligned to this size as defined in IO hints.
 */
if (unlikely((bio->bi_iter.bi_sector & ((cc->sector_size >> 
SECTOR_SHIFT) - 1)) != 0))
-   return -EIO;
+   return DM_MAPIO_KILL;
 
if (unlikely(bio->bi_iter.bi_size & (cc->sector_size - 1)))
-   return -EIO;
+   return DM_MAPIO_KILL;
 
io = dm_per_bio_data(bio, cc->per_bio_data_size);
crypt_io_init(io, cc, bio, dm_target_offset(ti, 
bio->bi_iter.bi_sector));
diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
index 13305a182611..e8f093b323ce 100644
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -321,7 +321,7 @@ static int flakey_map(struct dm_target *ti, struct bio *bio)
if (bio_data_dir(bio) == READ) {
if (!fc->corrupt_bio_byte && !test_bit(DROP_WRITES, 
&fc->flags) &&
!test_bit(ERROR_WRITES, &fc->flags))
-   return -EIO;
+   return DM_MAPIO_KILL;
goto map_bio;
}
 
@@ -349,7 +349,7 @@ static int flakey_map(struct dm_target *ti, struct bio *bio)
/*
 * By default, error all I/O.
 */
-   return -EIO;
+   return DM_MAPIO_KILL;
}
 
 map_bio:
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index c7f7c8d76576..ee78fb471229 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -1352,13 +1352,13 @@ static int dm_integrity_map(struct dm_target *ti, 
struct bio *bio)
DMERR("Too big sector number: 0x%llx + 0x%x > 0x%llx",
  (unsigned long long)dio->range.logical_sector, 
bio_sectors(bio),
  (unsigned long long)ic->provided_data_sectors);
-   return -EIO;
+   return DM_MAPIO_KILL;
}
if (unlikely((dio->range.logical_sector | bio_sectors(bio)) & 
(unsigned)(ic->sectors_per_block - 1))) {
DMERR("Bio not aligned on %u sectors: 0x%llx, 0x%x",
  ic->sectors_per_block,
  (unsigned long long)dio->range.logical_sector, 
bio_sectors(bio));
-   return -EIO;
+   return DM_MAPIO_KILL;
}
 
if (ic->sectors_per_block > 1) {
@@ -1368,7 +1368,7 @@ static int dm_integrity_map(struct dm_target *ti, struct 
bio *bio)
if (unlikely((bv.bv_offset | bv.bv_len) & 
((ic->sectors_per_block << SECTOR_SHIFT) - 1))) {
DMERR("Bio vector (%u,%u) is not aligned on 
%u-sector boundary",
bv.bv_offset, bv.bv_len, 
ic->sectors_per_block);
-   return -EIO;
+   return DM_MAPIO_KILL;
}
}
}
@@ -1383,18 +1383,18 @@ static int dm_integrity_map(struct dm_target *ti, 
struct bio *bio)
wanted_tag_size *= ic->tag_size;
if (unlikely(wanted_tag_size != bip->bip_iter.bi_size)) 
{
DMERR("Invalid integrity data size %u, expected 
%u", bip->bip_iter.bi_size, wanted_tag_size);
-   return -EIO;
+   return DM_MAPIO_KILL;
}
}
} else {
if (unlikely(bip != NULL)) {
DMERR("Unexpected integrity data when using internal 
hash");
-   return -EIO;
+   return DM_MAPIO_KILL;
}
}
 
if (unlikely(ic->mode == 'R') && unlikely(dio->write))
-   return -EIO;
+   return DM_MAPIO_KILL;
 

[PATCH 05/13] fs: remove the unused error argument to dio_end_io()

2017-06-03 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
---
 fs/btrfs/inode.c   | 6 +++---
 fs/direct-io.c | 3 +--
 include/linux/fs.h | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 17cbe9306faf..758b2666885e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8244,7 +8244,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
kfree(dip);
 
dio_bio->bi_error = bio->bi_error;
-   dio_end_io(dio_bio, bio->bi_error);
+   dio_end_io(dio_bio);
 
if (io_bio->end_io)
io_bio->end_io(io_bio, err);
@@ -8304,7 +8304,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
kfree(dip);
 
dio_bio->bi_error = bio->bi_error;
-   dio_end_io(dio_bio, bio->bi_error);
+   dio_end_io(dio_bio);
bio_put(bio);
 }
 
@@ -8673,7 +8673,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
 * Releases and cleans up our dio_bio, no need to bio_put()
 * nor bio_endio()/bio_io_error() against dio_bio.
 */
-   dio_end_io(dio_bio, ret);
+   dio_end_io(dio_bio);
}
if (io_bio)
bio_put(io_bio);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..04247a6c3f73 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -348,13 +348,12 @@ static void dio_bio_end_io(struct bio *bio)
 /**
  * dio_end_io - handle the end io action for the given bio
  * @bio: The direct io bio thats being completed
- * @error: Error if there was one
  *
  * This is meant to be called by any filesystem that uses their own 
dio_submit_t
  * so that the DIO specific endio actions are dealt with after the filesystem
  * has done it's completion work.
  */
-void dio_end_io(struct bio *bio, int error)
+void dio_end_io(struct bio *bio)
 {
struct dio *dio = bio->bi_private;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..4388ab58843d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2843,7 +2843,7 @@ enum {
DIO_SKIP_DIO_COUNT = 0x08,
 };
 
-void dio_end_io(struct bio *bio, int error);
+void dio_end_io(struct bio *bio);
 
 ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 struct block_device *bdev, struct iov_iter *iter,
-- 
2.11.0



[PATCH 06/13] fs: simplify dio_bio_complete

2017-06-03 Thread Christoph Hellwig
Only read bio->bi_error once in the common path.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
---
 fs/direct-io.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 04247a6c3f73..bb711e4b86c2 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -477,13 +477,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
 {
struct bio_vec *bvec;
unsigned i;
-   int err;
+   int err = bio->bi_error;
 
-   if (bio->bi_error)
+   if (err)
dio->io_error = -EIO;
 
if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
-   err = bio->bi_error;
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
bio_for_each_segment_all(bvec, bio, i) {
@@ -494,7 +493,6 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
set_page_dirty_lock(page);
put_page(page);
}
-   err = bio->bi_error;
bio_put(bio);
}
return err;
-- 
2.11.0



[PATCH 10/13] dm: change ->end_io calling convention

2017-06-03 Thread Christoph Hellwig
Turn the error paramter into a pointer so that target drivers can change
the value, and make sure only DM_ENDIO_* values are returned from the
methods.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-cache-target.c  |  4 ++--
 drivers/md/dm-flakey.c|  8 
 drivers/md/dm-log-writes.c|  4 ++--
 drivers/md/dm-mpath.c | 11 ++-
 drivers/md/dm-raid1.c | 14 +++---
 drivers/md/dm-snap.c  |  4 ++--
 drivers/md/dm-stripe.c| 14 +++---
 drivers/md/dm-thin.c  |  4 ++--
 drivers/md/dm.c   | 36 ++--
 include/linux/device-mapper.h |  2 +-
 10 files changed, 51 insertions(+), 50 deletions(-)

diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index d682a0511381..c48612e6d525 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -2820,7 +2820,7 @@ static int cache_map(struct dm_target *ti, struct bio 
*bio)
return r;
 }
 
-static int cache_end_io(struct dm_target *ti, struct bio *bio, int error)
+static int cache_end_io(struct dm_target *ti, struct bio *bio, int *error)
 {
struct cache *cache = ti->private;
unsigned long flags;
@@ -2838,7 +2838,7 @@ static int cache_end_io(struct dm_target *ti, struct bio 
*bio, int error)
bio_drop_shared_lock(cache, bio);
accounted_complete(cache, bio);
 
-   return 0;
+   return DM_ENDIO_DONE;
 }
 
 static int write_dirty_bitset(struct cache *cache)
diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
index e8f093b323ce..c9539917a59b 100644
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -358,12 +358,12 @@ static int flakey_map(struct dm_target *ti, struct bio 
*bio)
return DM_MAPIO_REMAPPED;
 }
 
-static int flakey_end_io(struct dm_target *ti, struct bio *bio, int error)
+static int flakey_end_io(struct dm_target *ti, struct bio *bio, int *error)
 {
struct flakey_c *fc = ti->private;
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct 
per_bio_data));
 
-   if (!error && pb->bio_submitted && (bio_data_dir(bio) == READ)) {
+   if (!*error && pb->bio_submitted && (bio_data_dir(bio) == READ)) {
if (fc->corrupt_bio_byte && (fc->corrupt_bio_rw == READ) &&
all_corrupt_bio_flags_match(bio, fc)) {
/*
@@ -377,11 +377,11 @@ static int flakey_end_io(struct dm_target *ti, struct bio 
*bio, int error)
 * Error read during the down_interval if drop_writes
 * and error_writes were not configured.
 */
-   return -EIO;
+   *error = -EIO;
}
}
 
-   return error;
+   return DM_ENDIO_DONE;
 }
 
 static void flakey_status(struct dm_target *ti, status_type_t type,
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index e42264706c59..cc57c7fa1268 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -664,7 +664,7 @@ static int log_writes_map(struct dm_target *ti, struct bio 
*bio)
return DM_MAPIO_REMAPPED;
 }
 
-static int normal_end_io(struct dm_target *ti, struct bio *bio, int error)
+static int normal_end_io(struct dm_target *ti, struct bio *bio, int *error)
 {
struct log_writes_c *lc = ti->private;
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct 
per_bio_data));
@@ -686,7 +686,7 @@ static int normal_end_io(struct dm_target *ti, struct bio 
*bio, int error)
spin_unlock_irqrestore(&lc->blocks_lock, flags);
}
 
-   return error;
+   return DM_ENDIO_DONE;
 }
 
 /*
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index bf6e49c780d5..ceeeb495d01c 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1517,14 +1517,15 @@ static int multipath_end_io(struct dm_target *ti, 
struct request *clone,
return r;
 }
 
-static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, int 
error)
+static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, int 
*error)
 {
struct multipath *m = ti->private;
struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
struct pgpath *pgpath = mpio->pgpath;
unsigned long flags;
+   int r = DM_ENDIO_DONE;
 
-   if (!error || noretry_error(error))
+   if (!*error || noretry_error(*error))
goto done;
 
if (pgpath)
@@ -1533,7 +1534,7 @@ static int multipath_end_io_bio(struct dm_target *ti, 
struct bio *clone, int err
if (atomic_read(&m->nr_valid_paths) == 0 &&
!test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
dm_report_EIO(m);
-   error = -EIO;
+   *error = -EIO;
goto done;
}
 
@@ -1546,7 +1547,7 @@ static int multipath_end_io_bio(struct dm_target *ti, 
struct bio *clone, int err
if (!test_b

[PATCH 12/13] blk-mq: switch ->queue_rq return value to blk_status_t

2017-06-03 Thread Christoph Hellwig
Use the same values for use for request completion errors as the return
value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c| 37 --
 drivers/block/loop.c  |  6 +++---
 drivers/block/mtip32xx/mtip32xx.c | 17 
 drivers/block/nbd.c   | 12 ---
 drivers/block/null_blk.c  |  4 ++--
 drivers/block/rbd.c   |  4 ++--
 drivers/block/virtio_blk.c| 10 +-
 drivers/block/xen-blkfront.c  |  8 
 drivers/md/dm-rq.c|  8 
 drivers/mtd/ubi/block.c   |  6 +++---
 drivers/nvme/host/core.c  | 14 ++---
 drivers/nvme/host/fc.c| 23 +++--
 drivers/nvme/host/nvme.h  |  2 +-
 drivers/nvme/host/pci.c   | 42 +++
 drivers/nvme/host/rdma.c  | 26 +---
 drivers/nvme/target/loop.c| 17 
 drivers/scsi/scsi_lib.c   | 30 ++--
 include/linux/blk-mq.h|  7 ++-
 18 files changed, 131 insertions(+), 142 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index adcc1c0dce6e..7af78b1e9db9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -924,7 +924,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
 {
struct blk_mq_hw_ctx *hctx;
struct request *rq;
-   int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK;
+   int errors, queued;
 
if (list_empty(list))
return false;
@@ -935,6 +935,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
errors = queued = 0;
do {
struct blk_mq_queue_data bd;
+   blk_status_t ret;
 
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -975,25 +976,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
}
 
ret = q->mq_ops->queue_rq(hctx, &bd);
-   switch (ret) {
-   case BLK_MQ_RQ_QUEUE_OK:
-   queued++;
-   break;
-   case BLK_MQ_RQ_QUEUE_BUSY:
+   if (ret == BLK_STS_RESOURCE) {
blk_mq_put_driver_tag_hctx(hctx, rq);
list_add(&rq->queuelist, list);
__blk_mq_requeue_request(rq);
break;
-   default:
-   pr_err("blk-mq: bad return on queue: %d\n", ret);
-   case BLK_MQ_RQ_QUEUE_ERROR:
+   }
+
+   if (unlikely(ret != BLK_STS_OK)) {
errors++;
blk_mq_end_request(rq, BLK_STS_IOERR);
-   break;
+   continue;
}
 
-   if (ret == BLK_MQ_RQ_QUEUE_BUSY)
-   break;
+   queued++;
} while (!list_empty(list));
 
hctx->dispatched[queued_to_index(queued)]++;
@@ -1031,7 +1027,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
 * - blk_mq_run_hw_queue() checks whether or not a queue has
 *   been stopped before rerunning a queue.
 * - Some but not all block drivers stop a queue before
-*   returning BLK_MQ_RQ_QUEUE_BUSY. Two exceptions are scsi-mq
+*   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 *   and dm-rq.
 */
if (!blk_mq_sched_needs_restart(hctx) &&
@@ -1410,7 +1406,7 @@ static void __blk_mq_try_issue_directly(struct request 
*rq, blk_qc_t *cookie,
};
struct blk_mq_hw_ctx *hctx;
blk_qc_t new_cookie;
-   int ret;
+   blk_status_t ret;
 
if (q->elevator)
goto insert;
@@ -1426,18 +1422,19 @@ static void __blk_mq_try_issue_directly(struct request 
*rq, blk_qc_t *cookie,
 * would have done
 */
ret = q->mq_ops->queue_rq(hctx, &bd);
-   if (ret == BLK_MQ_RQ_QUEUE_OK) {
+   switch (ret) {
+   case BLK_STS_OK:
*cookie = new_cookie;
return;
-   }
-
-   if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
+   case BLK_STS_RESOURCE:
+   __blk_mq_requeue_request(rq);
+   goto insert;
+   default:
*cookie = BLK_QC_T_NONE;
-   blk_mq_end_request(rq, BLK_STS_IOERR);
+   blk_mq_end_request(rq, ret);
return;
}
 
-   __blk_mq_requeue_request(rq);
 insert:
blk_mq_sched_insert_request(rq, false, true, false, may_sleep);
 }
diff --git a/drivers/block/loop.c b/drivers/block/loop.c

Re: Block device integrity support

2017-06-03 Thread Jens Axboe
On 06/03/2017 12:32 AM, Christoph Hellwig wrote:
> On Fri, Jun 02, 2017 at 03:52:57PM -0700, Jens Axboe wrote:
>> On 06/02/2017 03:38 PM, Bart Van Assche wrote:
>>> On Fri, 2017-06-02 at 18:24 -0400, Martin K. Petersen wrote:
> then the output shown below appears in the kernel log. Does anyone know 
> how
> to fix this? Sorry but I'm not really familiar with the integrity
> code.

 Dmitry posted a fix for this a few weeks ago.
>>>
>>> Ah, that's right. Jens, had you noticed this message:
>>> https://www.spinics.net/lists/fstests/msg06214.html?
>>
>> No, unfortunately not. I'll queue it up for 4.13 and mark it
>> stable for 4.12.
> 
> Can we please get it into 4.12-rc?

Yes of course, that's actually what I did. I just got the version
numbers mixed up.

-- 
Jens Axboe



[PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper

2017-06-03 Thread Christoph Hellwig
Factor out code from the x86 cpu hot plug code to program the affinity
for a vector for a hot plug / hot unplug event.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/kernel/irq.c | 23 ++-
 include/linux/interrupt.h |  1 +
 kernel/irq/affinity.c | 28 
 3 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index f34fe7444836..a54eac5d81b3 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -437,7 +437,6 @@ void fixup_irqs(void)
struct irq_desc *desc;
struct irq_data *data;
struct irq_chip *chip;
-   int ret;
 
for_each_irq_desc(irq, desc) {
int break_affinity = 0;
@@ -482,26 +481,8 @@ void fixup_irqs(void)
continue;
}
 
-   if (!irqd_can_move_in_process_context(data) && chip->irq_mask)
-   chip->irq_mask(data);
-
-   if (chip->irq_set_affinity) {
-   ret = chip->irq_set_affinity(data, affinity, true);
-   if (ret == -ENOSPC)
-   pr_crit("IRQ %d set affinity failed because 
there are no available vectors.  The device assigned to this IRQ is 
unstable.\n", irq);
-   } else {
-   if (!(warned++))
-   set_affinity = 0;
-   }
-
-   /*
-* We unmask if the irq was not marked masked by the
-* core code. That respects the lazy irq disable
-* behaviour.
-*/
-   if (!irqd_can_move_in_process_context(data) &&
-   !irqd_irq_masked(data) && chip->irq_unmask)
-   chip->irq_unmask(data);
+   if (!irq_affinity_set(irq, desc, affinity) && !warned++)
+   set_affinity = 0;
 
raw_spin_unlock(&desc->lock);
 
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a6fba4804672..afd3aa33e9b0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -292,6 +292,7 @@ irq_set_affinity_notifier(unsigned int irq, struct 
irq_affinity_notify *notify);
 
 struct cpumask *irq_create_affinity_masks(int nvec, const struct irq_affinity 
*affd);
 int irq_calc_affinity_vectors(int maxvec, const struct irq_affinity *affd);
+bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask);
 
 #else /* CONFIG_SMP */
 
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e2d356dd7581..3cec0042fad2 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,8 +1,36 @@
 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include "internals.h"
+
+bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)
+{
+   struct irq_data *data = irq_desc_get_irq_data(desc);
+   struct irq_chip *chip = irq_data_get_irq_chip(data);
+   bool ret = false;
+
+   if (!irq_can_move_pcntxt(data) && chip->irq_mask)
+   chip->irq_mask(data);
+
+   if (chip->irq_set_affinity) {
+   if (chip->irq_set_affinity(data, mask, true) == -ENOSPC)
+   pr_crit("IRQ %d set affinity failed because there are 
no available vectors.  The device assigned to this IRQ is unstable.\n", irq);
+   ret = true;
+   }
+
+   /*
+* We unmask if the irq was not marked masked by the core code.
+* That respects the lazy irq disable behaviour.
+*/
+   if (!irq_can_move_pcntxt(data) &&
+   !irqd_irq_masked(data) && chip->irq_unmask)
+   chip->irq_unmask(data);
+
+   return ret;
+}
 
 static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
int cpus_per_vec)
-- 
2.11.0



[PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events

2017-06-03 Thread Christoph Hellwig
Remove a CPU from the affinity mask when it goes offline and add it
back when it returns.  In case the vetor was assigned only to the CPU
going offline it will be shutdown and re-started when the CPU
reappears.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/kernel/irq.c  |   3 +-
 include/linux/cpuhotplug.h |   1 +
 include/linux/irq.h|   9 
 kernel/cpu.c   |   6 +++
 kernel/irq/affinity.c  | 127 -
 5 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index a54eac5d81b3..72c35ed534f1 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -453,7 +453,8 @@ void fixup_irqs(void)
 
data = irq_desc_get_irq_data(desc);
affinity = irq_data_get_affinity_mask(data);
-   if (!irq_has_action(irq) || irqd_is_per_cpu(data) ||
+   if (irqd_affinity_is_managed(data) ||
+   !irq_has_action(irq) || irqd_is_per_cpu(data) ||
cpumask_subset(affinity, cpu_online_mask)) {
raw_spin_unlock(&desc->lock);
continue;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 0f2a80377520..c15f22c54535 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -124,6 +124,7 @@ enum cpuhp_state {
CPUHP_AP_ONLINE_IDLE,
CPUHP_AP_SMPBOOT_THREADS,
CPUHP_AP_X86_VDSO_VMA_ONLINE,
+   CPUHP_AP_IRQ_AFFINITY_ONLINE,
CPUHP_AP_PERF_ONLINE,
CPUHP_AP_PERF_X86_ONLINE,
CPUHP_AP_PERF_X86_UNCORE_ONLINE,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f887351aa80e..ae15b8582685 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -216,6 +216,7 @@ enum {
IRQD_WAKEUP_ARMED   = (1 << 19),
IRQD_FORWARDED_TO_VCPU  = (1 << 20),
IRQD_AFFINITY_MANAGED   = (1 << 21),
+   IRQD_AFFINITY_SUSPENDED = (1 << 22),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -329,6 +330,11 @@ static inline void irqd_clr_activated(struct irq_data *d)
__irqd_to_state(d) &= ~IRQD_ACTIVATED;
 }
 
+static inline bool irqd_affinity_is_suspended(struct irq_data *d)
+{
+   return __irqd_to_state(d) & IRQD_AFFINITY_SUSPENDED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
@@ -1025,4 +1031,7 @@ int __ipi_send_mask(struct irq_desc *desc, const struct 
cpumask *dest);
 int ipi_send_single(unsigned int virq, unsigned int cpu);
 int ipi_send_mask(unsigned int virq, const struct cpumask *dest);
 
+int irq_affinity_online_cpu(unsigned int cpu);
+int irq_affinity_offline_cpu(unsigned int cpu);
+
 #endif /* _LINUX_IRQ_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9ae6fbe5b5cf..ef0c5b63ca0d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #define CREATE_TRACE_POINTS
@@ -1252,6 +1253,11 @@ static struct cpuhp_step cpuhp_ap_states[] = {
.startup.single = smpboot_unpark_threads,
.teardown.single= NULL,
},
+   [CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
+   .name   = "irq/affinity:online",
+   .startup.single = irq_affinity_online_cpu,
+   .teardown.single= irq_affinity_offline_cpu,
+   },
[CPUHP_AP_PERF_ONLINE] = {
.name   = "perf:online",
.startup.single = perf_event_init_cpu,
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 337e6ffba93f..e27ecfb4866f 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,4 +1,7 @@
-
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
 #include 
 #include 
 #include 
@@ -227,3 +230,125 @@ int irq_calc_affinity_vectors(int maxvec, const struct 
irq_affinity *affd)
 
return min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
 }
+
+static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
+   unsigned int cpu)
+{
+   const struct cpumask *affinity;
+   struct irq_data *data;
+   struct irq_chip *chip;
+   unsigned long flags;
+   cpumask_var_t mask;
+
+   if (!desc)
+   return;
+   if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+   return;
+
+   raw_spin_lock_irqsave(&desc->lock, flags);
+
+   data = irq_desc_get_irq_data(desc);
+   affinity = irq_data_get_affinity_mask(data);
+   if (!irqd_affinity_is_managed(data) ||
+   !irq_has_action(irq) ||
+   !cpumask_test_cpu(cpu, affinity))
+   goto out_free_cpumask;
+
+   /*
+* The interrupt descriptor might have been cleaned up
+* already, but it is not yet removed from the 

spread MSI(-X) vectors to all possible CPUs V2

2017-06-03 Thread Christoph Hellwig
Hi all,

this series changes our automatic MSI-X vector assignment so that it
takes all present CPUs into account instead of all online ones.  This
allows to better deal with cpu hotplug events, which could happen
frequently due to power management for example.

Changes since V1:
 - rebase to current Linus' tree
 - add irq_lock_sparse calls
 - move memory allocations outside of (raw) spinlocks
 - make the possible cpus per node mask safe vs physical CPU hotplug
 - remove the irq_force_complete_move call
 - factor some common code into helpers
 - identation fixups


[PATCH 8/8] nvme: allocate queues for all possible CPUs

2017-06-03 Thread Christoph Hellwig
Unlike most drіvers that simply pass the maximum possible vectors to
pci_alloc_irq_vectors NVMe needs to configure the device before allocting
the vectors, so it needs a manual update for the new scheme of using
all present CPUs.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d52701df7245..4152d93fbbef 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1525,7 +1525,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int result, nr_io_queues, size;
 
-   nr_io_queues = num_online_cpus();
+   nr_io_queues = num_present_cpus();
result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
if (result < 0)
return result;
-- 
2.11.0



[PATCH 4/8] genirq/affinity: assign vectors to all present CPUs

2017-06-03 Thread Christoph Hellwig
Currently we only assign spread vectors to online CPUs, which ties the
IRQ mapping to the currently online devices and doesn't deal nicely with
the fact that CPUs could come and go rapidly due to e.g. power management.

Instead assign vectors to all present CPUs to avoid this churn.

For this we have to build a map of all possible CPUs for a give node, as
the architectures only provide a map of all onlines CPUs.  We do this
dynamically on each call for the vector assingments, which is a bit
suboptimal and could be optimized in the future by provinding a mapping
from the arch code.

Signed-off-by: Christoph Hellwig 
---
 kernel/irq/affinity.c | 71 +--
 1 file changed, 57 insertions(+), 14 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 3cec0042fad2..337e6ffba93f 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -63,13 +63,54 @@ static void irq_spread_init_one(struct cpumask *irqmsk, 
struct cpumask *nmsk,
}
 }
 
-static int get_nodes_in_cpumask(const struct cpumask *mask, nodemask_t 
*nodemsk)
+static cpumask_var_t *alloc_node_to_present_cpumask(void)
+{
+   int node;
+   cpumask_var_t *masks;
+
+   masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+   if (!masks)
+   return NULL;
+
+   for (node = 0; node < nr_node_ids; node++) {
+   if (!zalloc_cpumask_var(&masks[node], GFP_KERNEL))
+   goto out_unwind;
+   }
+
+   return masks;
+
+out_unwind:
+   while (--node >= 0)
+   free_cpumask_var(masks[node]);
+   kfree(masks);
+   return NULL;
+}
+
+static void free_node_to_present_cpumask(cpumask_var_t *masks)
+{
+   int node;
+
+   for (node = 0; node < nr_node_ids; node++)
+   free_cpumask_var(masks[node]);
+   kfree(masks);
+}
+
+static void build_node_to_present_cpumask(cpumask_var_t *masks)
+{
+   int cpu;
+
+   for_each_present_cpu(cpu)
+   cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
+}
+
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
+   const struct cpumask *mask, nodemask_t *nodemsk)
 {
int n, nodes = 0;
 
/* Calculate the number of nodes in the supplied affinity mask */
-   for_each_online_node(n) {
-   if (cpumask_intersects(mask, cpumask_of_node(n))) {
+   for_each_node(n) {
+   if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
node_set(n, *nodemsk);
nodes++;
}
@@ -92,7 +133,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
struct cpumask *masks;
-   cpumask_var_t nmsk;
+   cpumask_var_t nmsk, *node_to_present_cpumask;
 
if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
return NULL;
@@ -101,13 +142,19 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (!masks)
goto out;
 
+   node_to_present_cpumask = alloc_node_to_present_cpumask();
+   if (!node_to_present_cpumask)
+   goto out;
+
/* Fill out vectors at the beginning that don't need affinity */
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
 
/* Stabilize the cpumasks */
get_online_cpus();
-   nodes = get_nodes_in_cpumask(cpu_online_mask, &nodemsk);
+   build_node_to_present_cpumask(node_to_present_cpumask);
+   nodes = get_nodes_in_cpumask(node_to_present_cpumask, cpu_present_mask,
+   &nodemsk);
 
/*
 * If the number of nodes in the mask is greater than or equal the
@@ -115,7 +162,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
 */
if (affv <= nodes) {
for_each_node_mask(n, nodemsk) {
-   cpumask_copy(masks + curvec, cpumask_of_node(n));
+   cpumask_copy(masks + curvec,
+node_to_present_cpumask[n]);
if (++curvec == last_affv)
break;
}
@@ -129,7 +177,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
/* Get the cpus on this node which are in the mask */
-   cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));
+   cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);
 
/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -161,6 +209,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Fill out v

[PATCH 7/8] blk-mq: create hctx for each present CPU

2017-06-03 Thread Christoph Hellwig
Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations.  Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c | 120 +
 block/blk-mq.h |   5 --
 include/linux/cpuhotplug.h |   1 -
 3 files changed, 11 insertions(+), 115 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1bcccedcc74f..66ca9a090984 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -37,9 +37,6 @@
 #include "blk-wbt.h"
 #include "blk-mq-sched.h"
 
-static DEFINE_MUTEX(all_q_mutex);
-static LIST_HEAD(all_q_list);
-
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 static void __blk_mq_stop_hw_queues(struct request_queue *q, bool sync);
@@ -1966,8 +1963,8 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
INIT_LIST_HEAD(&__ctx->rq_list);
__ctx->queue = q;
 
-   /* If the cpu isn't online, the cpu is mapped to first hctx */
-   if (!cpu_online(i))
+   /* If the cpu isn't present, the cpu is mapped to first hctx */
+   if (!cpu_present(i))
continue;
 
hctx = blk_mq_map_queue(q, i);
@@ -2010,8 +2007,7 @@ static void blk_mq_free_map_and_requests(struct 
blk_mq_tag_set *set,
}
 }
 
-static void blk_mq_map_swqueue(struct request_queue *q,
-  const struct cpumask *online_mask)
+static void blk_mq_map_swqueue(struct request_queue *q)
 {
unsigned int i, hctx_idx;
struct blk_mq_hw_ctx *hctx;
@@ -2029,13 +2025,11 @@ static void blk_mq_map_swqueue(struct request_queue *q,
}
 
/*
-* Map software to hardware queues
+* Map software to hardware queues.
+*
+* If the cpu isn't present, the cpu is mapped to first hctx.
 */
-   for_each_possible_cpu(i) {
-   /* If the cpu isn't online, the cpu is mapped to first hctx */
-   if (!cpumask_test_cpu(i, online_mask))
-   continue;
-
+   for_each_present_cpu(i) {
hctx_idx = q->mq_map[i];
/* unmapped hw queue can be remapped after CPU topo changed */
if (!set->tags[hctx_idx] &&
@@ -2321,16 +2315,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
blk_queue_softirq_done(q, set->ops->complete);
 
blk_mq_init_cpu_queues(q, set->nr_hw_queues);
-
-   get_online_cpus();
-   mutex_lock(&all_q_mutex);
-
-   list_add_tail(&q->all_q_node, &all_q_list);
blk_mq_add_queue_tag_set(set, q);
-   blk_mq_map_swqueue(q, cpu_online_mask);
-
-   mutex_unlock(&all_q_mutex);
-   put_online_cpus();
+   blk_mq_map_swqueue(q);
 
if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
int ret;
@@ -2356,18 +2342,12 @@ void blk_mq_free_queue(struct request_queue *q)
 {
struct blk_mq_tag_set   *set = q->tag_set;
 
-   mutex_lock(&all_q_mutex);
-   list_del_init(&q->all_q_node);
-   mutex_unlock(&all_q_mutex);
-
blk_mq_del_queue_tag_set(q);
-
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-static void blk_mq_queue_reinit(struct request_queue *q,
-   const struct cpumask *online_mask)
+static void blk_mq_queue_reinit(struct request_queue *q)
 {
WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
 
@@ -2380,76 +2360,12 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 * involves free and re-allocate memory, worthy doing?)
 */
 
-   blk_mq_map_swqueue(q, online_mask);
+   blk_mq_map_swqueue(q);
 
blk_mq_sysfs_register(q);
blk_mq_debugfs_register_hctxs(q);
 }
 
-/*
- * New online cpumask which is going to be set in this hotplug event.
- * Declare this cpumasks as global as cpu-hotplug operation is invoked
- * one-by-one and dynamically allocating this could result in a failure.
- */
-static struct cpumask cpuhp_online_new;
-
-static void blk_mq_queue_reinit_work(void)
-{
-   struct request_queue *q;
-
-   mutex_lock(&all_q_mutex);
-   /*
-* We need to freeze and reinit all existing queues.  Freezing
-* involves synchronous wait for an RCU grace period and doing it
-* one by one may take a long time.  Start freezing all queues in
-* one swoop and then wait for the completions so that freezing can
-* take place in parallel.
-*/
-   list_for_each_entry(q, &all_q_list, all_q_node)
-   blk_freeze_queue_start(q);
-   list_for_each_entry(q, &all_q_list, all_q_node)
-   blk_mq_freeze_queue_wait(q);
-
-   list_for_each_entry(q, &al

[PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs

2017-06-03 Thread Christoph Hellwig
This will allow us to spread MSI/MSI-X affinity over all present CPUs and
thus better deal with systems where cpus are take on and offline all the
time.

Signed-off-by: Christoph Hellwig 
---
 kernel/irq/manage.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 070be980c37a..5c25d4a5dc46 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -361,17 +361,17 @@ static int setup_affinity(struct irq_desc *desc, struct 
cpumask *mask)
if (irqd_affinity_is_managed(&desc->irq_data) ||
irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
if (cpumask_intersects(desc->irq_common_data.affinity,
-  cpu_online_mask))
+  cpu_present_mask))
set = desc->irq_common_data.affinity;
else
irqd_clear(&desc->irq_data, IRQD_AFFINITY_SET);
}
 
-   cpumask_and(mask, cpu_online_mask, set);
+   cpumask_and(mask, cpu_present_mask, set);
if (node != NUMA_NO_NODE) {
const struct cpumask *nodemask = cpumask_of_node(node);
 
-   /* make sure at least one of the cpus in nodemask is online */
+   /* make sure at least one of the cpus in nodemask is present */
if (cpumask_intersects(mask, nodemask))
cpumask_and(mask, mask, nodemask);
}
-- 
2.11.0



[PATCH 2/8] genirq: move pending helpers to internal.h

2017-06-03 Thread Christoph Hellwig
So that the affinity code can reuse them.

Signed-off-by: Christoph Hellwig 
---
 kernel/irq/internals.h | 38 ++
 kernel/irq/manage.c| 28 
 2 files changed, 38 insertions(+), 28 deletions(-)

diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index bc226e783bd2..b81f6ce73a68 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -226,3 +226,41 @@ irq_pm_install_action(struct irq_desc *desc, struct 
irqaction *action) { }
 static inline void
 irq_pm_remove_action(struct irq_desc *desc, struct irqaction *action) { }
 #endif
+
+#ifdef CONFIG_GENERIC_PENDING_IRQ
+static inline bool irq_can_move_pcntxt(struct irq_data *data)
+{
+   return irqd_can_move_in_process_context(data);
+}
+static inline bool irq_move_pending(struct irq_data *data)
+{
+   return irqd_is_setaffinity_pending(data);
+}
+static inline void
+irq_copy_pending(struct irq_desc *desc, const struct cpumask *mask)
+{
+   cpumask_copy(desc->pending_mask, mask);
+}
+static inline void
+irq_get_pending(struct cpumask *mask, struct irq_desc *desc)
+{
+   cpumask_copy(mask, desc->pending_mask);
+}
+#else /* CONFIG_GENERIC_PENDING_IRQ */
+static inline bool irq_can_move_pcntxt(struct irq_data *data)
+{
+   return true;
+}
+static inline bool irq_move_pending(struct irq_data *data)
+{
+   return false;
+}
+static inline void
+irq_copy_pending(struct irq_desc *desc, const struct cpumask *mask)
+{
+}
+static inline void
+irq_get_pending(struct cpumask *mask, struct irq_desc *desc)
+{
+}
+#endif /* CONFIG_GENERIC_PENDING_IRQ */
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5c25d4a5dc46..5fa334e5c046 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -168,34 +168,6 @@ void irq_set_thread_affinity(struct irq_desc *desc)
set_bit(IRQTF_AFFINITY, &action->thread_flags);
 }
 
-#ifdef CONFIG_GENERIC_PENDING_IRQ
-static inline bool irq_can_move_pcntxt(struct irq_data *data)
-{
-   return irqd_can_move_in_process_context(data);
-}
-static inline bool irq_move_pending(struct irq_data *data)
-{
-   return irqd_is_setaffinity_pending(data);
-}
-static inline void
-irq_copy_pending(struct irq_desc *desc, const struct cpumask *mask)
-{
-   cpumask_copy(desc->pending_mask, mask);
-}
-static inline void
-irq_get_pending(struct cpumask *mask, struct irq_desc *desc)
-{
-   cpumask_copy(mask, desc->pending_mask);
-}
-#else
-static inline bool irq_can_move_pcntxt(struct irq_data *data) { return true; }
-static inline bool irq_move_pending(struct irq_data *data) { return false; }
-static inline void
-irq_copy_pending(struct irq_desc *desc, const struct cpumask *mask) { }
-static inline void
-irq_get_pending(struct cpumask *mask, struct irq_desc *desc) { }
-#endif
-
 int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
bool force)
 {
-- 
2.11.0



[PATCH 6/8] blk-mq: include all present CPUs in the default queue mapping

2017-06-03 Thread Christoph Hellwig
This way we get a nice distribution independent of the current cpu
online / offline state.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-cpumap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 8e61e8640e17..5eaecd40f701 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -35,7 +35,6 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set)
 {
unsigned int *map = set->mq_map;
unsigned int nr_queues = set->nr_hw_queues;
-   const struct cpumask *online_mask = cpu_online_mask;
unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
cpumask_var_t cpus;
 
@@ -44,7 +43,7 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set)
 
cpumask_clear(cpus);
nr_cpus = nr_uniq_cpus = 0;
-   for_each_cpu(i, online_mask) {
+   for_each_present_cpu(i) {
nr_cpus++;
first_sibling = get_first_sibling(i);
if (!cpumask_test_cpu(first_sibling, cpus))
@@ -54,7 +53,7 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set)
 
queue = 0;
for_each_possible_cpu(i) {
-   if (!cpumask_test_cpu(i, online_mask)) {
+   if (!cpumask_test_cpu(i, cpu_present_mask)) {
map[i] = 0;
continue;
}
-- 
2.11.0



[RFC v2 2/4] tracing: Add support for recording tgid of tasks

2017-06-03 Thread Joel Fernandes
Inorder to support recording of tgid, the following changes are made:

- Introduce a new API for optionally recording the tgid along with the task's
  comm which replaces the existing '*cmdline*' APIs.
- reuse the existing sched_switch and sched_wakeup probes
- replace all uses of the old API
- add a new option 'record-tgid' to enable recording of tgid

This will have no memory or runtime overhead if record-tgid option isn't 
enabled.

Cc: kernel-t...@android.com
Cc: Steven Rostedt 
Cc: Ingo Molnar  
Signed-off-by: Joel Fernandes 
---
 include/linux/trace_events.h | 10 -
 kernel/trace/blktrace.c  |  2 +-
 kernel/trace/trace.c | 79 --
 kernel/trace/trace.h |  9 +++-
 kernel/trace/trace_events.c  | 83 ++--
 kernel/trace/trace_functions.c   |  5 ++-
 kernel/trace/trace_functions_graph.c |  4 +-
 kernel/trace/trace_sched_switch.c| 67 +
 kernel/trace/trace_selftest.c|  2 +-
 9 files changed, 208 insertions(+), 53 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index a556805eff8a..bc54f1469971 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -151,7 +151,11 @@ trace_event_buffer_lock_reserve(struct ring_buffer 
**current_buffer,
int type, unsigned long len,
unsigned long flags, int pc);
 
-void tracing_record_cmdline(struct task_struct *tsk);
+void tracing_record_taskinfo(struct task_struct **tasks, int len, bool cmd,
+bool tgid);
+
+void tracing_record_taskinfo_single(struct task_struct *task, bool cmd,
+   bool tgid);
 
 int trace_output_call(struct trace_iterator *iter, char *name, char *fmt, ...);
 
@@ -290,6 +294,7 @@ struct trace_subsystem_dir;
 enum {
EVENT_FILE_FL_ENABLED_BIT,
EVENT_FILE_FL_RECORDED_CMD_BIT,
+   EVENT_FILE_FL_RECORDED_TGID_BIT,
EVENT_FILE_FL_FILTERED_BIT,
EVENT_FILE_FL_NO_SET_FILTER_BIT,
EVENT_FILE_FL_SOFT_MODE_BIT,
@@ -315,6 +320,7 @@ enum {
 enum {
EVENT_FILE_FL_ENABLED   = (1 << EVENT_FILE_FL_ENABLED_BIT),
EVENT_FILE_FL_RECORDED_CMD  = (1 << EVENT_FILE_FL_RECORDED_CMD_BIT),
+   EVENT_FILE_FL_RECORDED_TGID = (1 << 
EVENT_FILE_FL_RECORDED_TGID_BIT),
EVENT_FILE_FL_FILTERED  = (1 << EVENT_FILE_FL_FILTERED_BIT),
EVENT_FILE_FL_NO_SET_FILTER = (1 << 
EVENT_FILE_FL_NO_SET_FILTER_BIT),
EVENT_FILE_FL_SOFT_MODE = (1 << EVENT_FILE_FL_SOFT_MODE_BIT),
@@ -463,7 +469,7 @@ int trace_set_clr_event(const char *system, const char 
*event, int set);
 #define event_trace_printk(ip, fmt, args...)   \
 do {   \
__trace_printk_check_format(fmt, ##args);   \
-   tracing_record_cmdline(current);\
+   tracing_record_taskinfo_single(current, true, false);   \
if (__builtin_constant_p(fmt)) {\
static const char *trace_printk_fmt \
  __attribute__((section("__trace_printk_fmt"))) =  \
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 193c5f5e3f79..d7394cdf899e 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -236,7 +236,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t 
sector, int bytes,
cpu = raw_smp_processor_id();
 
if (blk_tracer) {
-   tracing_record_cmdline(current);
+   tracing_record_taskinfo_single(current, true, false);
 
buffer = blk_tr->trace_buffer.buffer;
pc = preempt_count();
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 63deff9cdf2c..7be21ae4f0a8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -87,7 +87,7 @@ dummy_set_flag(struct trace_array *tr, u32 old_flags, u32 
bit, int set)
  * tracing is active, only save the comm when a trace event
  * occurred.
  */
-static DEFINE_PER_CPU(bool, trace_cmdline_save);
+static DEFINE_PER_CPU(bool, trace_taskinfo_save);
 
 /*
  * Kill all tracing for good (never come back).
@@ -790,7 +790,7 @@ EXPORT_SYMBOL_GPL(tracing_on);
 static __always_inline void
 __buffer_unlock_commit(struct ring_buffer *buffer, struct ring_buffer_event 
*event)
 {
-   __this_cpu_write(trace_cmdline_save, true);
+   __this_cpu_write(trace_taskinfo_save, true);
 
/* If this is the temp buffer, we need to commit fully */
if (this_cpu_read(trace_buffered_event) == event) {
@@ -1709,6 +1709,15 @@ void tracing_reset_all_online_cpus(void)
}
 }
 
+static unsigned int *tgid_map;
+
+void tracing_alloc_tgid_map(void)
+{
+   tgid_map = kzalloc((PID_MAX_DEFAULT + 1) * siz

Re: [RFC v2 2/4] tracing: Add support for recording tgid of tasks

2017-06-03 Thread Joel Fernandes
Some minor things that I will rework in next rev after spending some
more time on it:

On Sat, Jun 3, 2017 at 9:03 PM, Joel Fernandes  wrote:
[..]
> @@ -463,7 +469,7 @@ int trace_set_clr_event(const char *system, const char 
> *event, int set);
>  #define event_trace_printk(ip, fmt, args...)   \
>  do {   \
> __trace_printk_check_format(fmt, ##args);   \
> -   tracing_record_cmdline(current);\
> +   tracing_record_taskinfo_single(current, true, false);   \
> if (__builtin_constant_p(fmt)) {\
> static const char *trace_printk_fmt \
>   __attribute__((section("__trace_printk_fmt"))) =  \
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 193c5f5e3f79..d7394cdf899e 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -236,7 +236,7 @@ static void __blk_add_trace(struct blk_trace *bt, 
> sector_t sector, int bytes,
> cpu = raw_smp_processor_id();
>
> if (blk_tracer) {
> -   tracing_record_cmdline(current);
> +   tracing_record_taskinfo_single(current, true, false);

I think I will try to preserve the existing API so that existing users
aren't bothered much.

>
> buffer = blk_tr->trace_buffer.buffer;
> pc = preempt_count();
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 63deff9cdf2c..7be21ae4f0a8 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -87,7 +87,7 @@ dummy_set_flag(struct trace_array *tr, u32 old_flags, u32 
> bit, int set)
>   * tracing is active, only save the comm when a trace event
>   * occurred.
>   */
> -static DEFINE_PER_CPU(bool, trace_cmdline_save);
> +static DEFINE_PER_CPU(bool, trace_taskinfo_save);
>
>  /*
>   * Kill all tracing for good (never come back).
> @@ -790,7 +790,7 @@ EXPORT_SYMBOL_GPL(tracing_on);
>  static __always_inline void
>  __buffer_unlock_commit(struct ring_buffer *buffer, struct ring_buffer_event 
> *event)
>  {
> -   __this_cpu_write(trace_cmdline_save, true);
> +   __this_cpu_write(trace_taskinfo_save, true);
>
> /* If this is the temp buffer, we need to commit fully */
> if (this_cpu_read(trace_buffered_event) == event) {
> @@ -1709,6 +1709,15 @@ void tracing_reset_all_online_cpus(void)
> }
>  }
>
> +static unsigned int *tgid_map;
> +
> +void tracing_alloc_tgid_map(void)
> +{
> +   tgid_map = kzalloc((PID_MAX_DEFAULT + 1) * sizeof(*tgid_map),
> +  GFP_KERNEL);
> +   WARN_ONCE(!tgid_map, "Allocation of tgid_map failed\n");

I should check if tgid_map is already allocated or there's a chance of
re-allocating.

Looking forward to any other comments...

thanks,

-Joel