Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Dan Williams
On Wed, Mar 14, 2018 at 12:34 PM, Stephen  Bates  wrote:
>> P2P over PCI/PCI-X is quite common in devices like raid controllers.
>
> Hi Dan
>
> Do you mean between PCIe devices below the RAID controller? Isn't it pretty 
> novel to be able to support PCIe EPs below a RAID controller (as opposed to 
> SCSI based devices)?

I'm thinking of the classic I/O offload card where there's an NTB to
an internal PCI bus that has a storage controller and raid offload
engines.


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Martin K. Petersen

Stephen,

>> It would be useful if those configurations were not left behind so
>> that Linux could feasibly deploy offload code to a controller in the
>> PCI domain.
>
> Agreed. I think this would be great. Kind of like the XCOPY framework
> that was proposed a while back for SCSI devices [1] but updated to also
> include NVMe devices. That is definitely a use case we would like this
> framework to support.

I'm on my umpteenth rewrite of the block/SCSI offload code. It is not as
protocol-agnostic as I would like in the block layer facing downwards.
It has proven quite hard to reconcile token-based and EXTENDED COPY
semantics along with the desire to support stacking. But from an
application/filesystem perspective everything looks the same regardless
of the intricacies of the device. Nothing is preventing us from
supporting other protocols...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V5 0/5] SCSI: fix selection of reply(hw) queue

2018-03-14 Thread Martin K. Petersen

Ming,

> The patches fixes reply queue(virt-queue on virtio-scsi) selection on
> hpsa, megaraid_sa and virtio-scsi, and IO hang can be caused easily by
> this issue.

I clarified all the commit descriptions. There were also a bunch of
duplicate review tags and other warnings. Please run checkpatch next
time!

Applied to 4.16/scsi-fixes. Thank you.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v5] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-14 Thread Joseph Qi
Hello Tejun,

Thanks for your quick response.

On 18/3/14 22:09, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 14, 2018 at 02:18:04PM +0800, Joseph Qi wrote:
>> Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
>> blkcg_bio_issue_check()")
>> Reported-by: Jiufei Xue 
>> Cc: sta...@vger.kernel.org #4.3+
> 
> I'm a bit nervous about tagging it for -stable.  Given the low rate of
> this actually occurring, I'm not sure the benefits outweigh the risks.
> Let's at least cook it for a couple releases before sending it to
> -stable.
> 
>> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
>> index 69bea82..dccd102 100644
>> --- a/include/linux/blk-cgroup.h
>> +++ b/include/linux/blk-cgroup.h
>> @@ -88,6 +88,7 @@ struct blkg_policy_data {
>>  /* the blkg and policy id this per-policy data belongs to */
>>  struct blkcg_gq *blkg;
>>  int plid;
>> +boolofflined;
>>  };
> 
> This is pure bike-shedding but offlined reads kinda weird to me, maybe
> just offline would read better?  Other than that,
> 
Do I need to resend a new version for this?

Thanks,
Joseph

>  Acked-by: Tejun Heo 
> 
> Thanks a lot for seeing this through.
> 


[PATCH v3, resend] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-03-14 Thread Bart Van Assche
It happens often while I'm preparing a patch for a block driver that
I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
available for this driver? Do I have to introduce definitions of these
constants before I can use these constants? To avoid this confusion,
move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
 header file such that these become available for all
block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
header file conditional to avoid that including that header file after
 causes the compiler to complain about a SECTOR_SIZE
redefinition.

Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
not been removed from uapi header files nor from NAND drivers in
which these constants are used for another purpose than converting
block layer offsets and sizes into a number of sectors.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Martin K. Petersen 
Cc: Sergey Senozhatsky 
Cc: David S. Miller 
Cc: Mike Snitzer 
Cc: Dan Williams 
Cc: Minchan Kim 
Cc: Nitin Gupta 
---

Changes compared to v2:
- Updated Reviewed-by tags.

Changes compared to v1:
- Changed enums into defines.
- Defined SECTOR_SIZE in terms of SECTOR_SHIFT.
- Made uapi SECTOR_SIZE definition conditional.

 arch/xtensa/platforms/iss/simdisk.c |  1 -
 drivers/block/brd.c |  1 -
 drivers/block/null_blk.c|  2 --
 drivers/block/rbd.c |  9 
 drivers/block/zram/zram_drv.h   |  1 -
 drivers/ide/ide-cd.c|  8 +++
 drivers/ide/ide-cd.h|  6 +-
 drivers/nvdimm/nd.h |  1 -
 drivers/scsi/gdth.h |  3 ---
 include/linux/blkdev.h  | 42 +++--
 include/linux/device-mapper.h   |  2 --
 include/linux/ide.h |  1 -
 include/uapi/linux/msdos_fs.h   |  2 ++
 13 files changed, 38 insertions(+), 41 deletions(-)

diff --git a/arch/xtensa/platforms/iss/simdisk.c 
b/arch/xtensa/platforms/iss/simdisk.c
index 1b6418407467..026211e7ab09 100644
--- a/arch/xtensa/platforms/iss/simdisk.c
+++ b/arch/xtensa/platforms/iss/simdisk.c
@@ -21,7 +21,6 @@
 #include 
 
 #define SIMDISK_MAJOR 240
-#define SECTOR_SHIFT 9
 #define SIMDISK_MINORS 1
 #define MAX_SIMDISK_COUNT 10
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index deea78e485da..66cb0f857f64 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -24,7 +24,6 @@
 
 #include 
 
-#define SECTOR_SHIFT   9
 #define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
 #define PAGE_SECTORS   (1 << PAGE_SECTORS_SHIFT)
 
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 0517613afccb..a76553293a31 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -16,10 +16,8 @@
 #include 
 #include 
 
-#define SECTOR_SHIFT   9
 #define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
 #define PAGE_SECTORS   (1 << PAGE_SECTORS_SHIFT)
-#define SECTOR_SIZE(1 << SECTOR_SHIFT)
 #define SECTOR_MASK(PAGE_SECTORS - 1)
 
 #define FREE_BATCH 16
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 0016170cde0a..1e03b04819c8 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -50,15 +50,6 @@
 
 #define RBD_DEBUG  /* Activate rbd_assert() calls */
 
-/*
- * The basic unit of block I/O is a sector.  It is interpreted in a
- * number of contexts in Linux (blk, bio, genhd), but the default is
- * universally 512 bytes.  These symbols are just slightly more
- * meaningful than the bare numbers they represent.
- */
-#defineSECTOR_SHIFT9
-#defineSECTOR_SIZE (1ULL << SECTOR_SHIFT)
-
 /*
  * Increment the given counter and return its updated value.
  * If the counter is already 0 it will not be incremented.
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 31762db861e3..1e9bf65c0bfb 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -37,7 +37,6 @@ static const size_t max_zpage_size = PAGE_SIZE / 4 * 3;
 
 /*-- End of configurable params */
 
-#define SECTOR_SHIFT   9
 #define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
 #define SECTORS_PER_PAGE   (1 << SECTORS_PER_PAGE_SHIFT)
 #define ZRAM_LOGICAL_BLOCK_SHIFT 12
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 5613cc2d51fc..5a8e8e3c22cd 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -712,7 +712,7 @@ static ide_startstop_t cdrom_start_rw(ide_drive_t *drive, 
struct request *rq)
struct request_queue *q = drive->queue;
int write = rq_data_dir(rq) == WRITE;
unsigned short sectors_per_frame =
-   queue_logical_block_size(q) >> SECTOR_BITS;
+   queue_logical_block_size(q) >> SECTOR_SHIFT;
 
ide_debug_log(IDE_DBG_RQ, "rq->cmd[0]: 0x%x, rq->cmd_flags: 0x%x, "
  "secs_per_fra

Re: dm mpath: fix passing integrity data

2018-03-14 Thread Mike Snitzer
On Wed, Mar 14 2018 at 10:33am -0400,
Steffen Maier  wrote:

> After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity
> data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support
> block integrity any more. So add it to the whitelist.
> 
> This is also a pre-requisite to use block integrity with other dm layer(s)
> on top of multipath, such as kpartx partitions (dm-linear) or LVM.
> 
> Signed-off-by: Steffen Maier 
> Bisected-by: Fedor Loshakov 
> Fixes: e2460f2a4bc7 ("dm: mark targets that pass integrity data")
> Cc:  #4.12+
> ---
>  drivers/md/dm-mpath.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 3fde9e9faddd..c174f0c53dc9 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -2023,7 +2023,8 @@ static int multipath_busy(struct dm_target *ti)
>  static struct target_type multipath_target = {
>   .name = "multipath",
>   .version = {1, 12, 0},
> - .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE,
> + .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE |
> + DM_TARGET_PASSES_INTEGRITY,
>   .module = THIS_MODULE,
>   .ctr = multipath_ctr,
>   .dtr = multipath_dtr,

Thanks, I've queued this for 4.16-rc6, will send to Linus tomorrow.


Re: [PATCH 8/8] block: sed-opal: ioctl for writing to shadow mbr

2018-03-14 Thread kbuild test robot
Hi Jonas,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20180309]
[cannot apply to linus/master v4.16-rc4 v4.16-rc3 v4.16-rc2 v4.16-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Jonas-Rabenstein/block-sed-opal-support-write-to-shadow-mbr/20180314-184749
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   block/sed-opal.c:381:20: sparse: incorrect type in assignment (different 
base types) @@expected unsigned long long [unsigned] [usertype] align @@
got ed long long [unsigned] [usertype] align @@
   block/sed-opal.c:381:20:expected unsigned long long [unsigned] 
[usertype] align
   block/sed-opal.c:381:20:got restricted __be64 const [usertype] 
alignment_granularity
   block/sed-opal.c:382:25: sparse: incorrect type in assignment (different 
base types) @@expected unsigned long long [unsigned] [usertype] lowest_lba 
@@got ed long long [unsigned] [usertype] lowest_lba @@
   block/sed-opal.c:382:25:expected unsigned long long [unsigned] 
[usertype] lowest_lba
   block/sed-opal.c:382:25:got restricted __be64 const [usertype] 
lowest_aligned_lba
>> block/sed-opal.c:1526:58: sparse: incorrect type in argument 2 (different 
>> address spaces) @@expected void const [noderef] *from @@got 
>> unsvoid const [noderef] *from @@
   block/sed-opal.c:1526:58:expected void const [noderef] *from
   block/sed-opal.c:1526:58:got unsigned char const [usertype] *
>> block/sed-opal.c:2100:14: sparse: incorrect type in argument 1 (different 
>> address spaces) @@expected void const volatile [noderef] 
>> * @@got onst volatile [noderef] * @@
   block/sed-opal.c:2100:14:expected void const volatile [noderef] 
*
   block/sed-opal.c:2100:14:got unsigned char const [usertype] *data

vim +1526 block/sed-opal.c

  1493  
  1494  static int write_shadow_mbr(struct opal_dev *dev, void *data)
  1495  {
  1496  struct opal_shadow_mbr *shadow = data;
  1497  size_t off;
  1498  u64 len;
  1499  int err = 0;
  1500  u8 *payload;
  1501  
  1502  /* FIXME: this is the maximum we can use for 
IO_BUFFER_LENGTH=2048.
  1503   *Instead of having constant, it would be nice to 
compute the
  1504   *actual value depending on IO_BUFFER_LENGTH
  1505   */
  1506  len = 1950;
  1507  
  1508  /* do the actual transmission(s) */
  1509  for (off = 0 ; off < shadow->size; off += len) {
  1510  len = min(len, shadow->size - off);
  1511  
  1512  pr_debug("MBR: write bytes %zu+%llu/%llu\n",
  1513   off, len, shadow->size);
  1514  err = start_opal_cmd(dev, opaluid[OPAL_MBR],
  1515   opalmethod[OPAL_SET]);
  1516  add_token_u8(&err, dev, OPAL_STARTNAME);
  1517  add_token_u8(&err, dev, OPAL_WHERE);
  1518  add_token_u64(&err, dev, shadow->offset + off);
  1519  add_token_u8(&err, dev, OPAL_ENDNAME);
  1520  
  1521  add_token_u8(&err, dev, OPAL_STARTNAME);
  1522  add_token_u8(&err, dev, OPAL_VALUES);
  1523  payload = add_bytestring_header(&err, dev, len);
  1524  if (!payload)
  1525  break;
> 1526  if (copy_from_user(payload, shadow->data + off, len))
  1527  err = -EFAULT;
  1528  
  1529  add_token_u8(&err, dev, OPAL_ENDNAME);
  1530  if (err)
  1531  break;
  1532  
  1533  err = finalize_and_send(dev, parse_and_check_status);
  1534  if (err)
  1535  break;
  1536  }
  1537  return err;
  1538  }
  1539  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Stephen Bates
> P2P over PCI/PCI-X is quite common in devices like raid controllers.

Hi Dan 

Do you mean between PCIe devices below the RAID controller? Isn't it pretty 
novel to be able to support PCIe EPs below a RAID controller (as opposed to 
SCSI based devices)?

> It would be useful if those configurations were not left behind so
> that Linux could feasibly deploy offload code to a controller in the
> PCI domain.
   
Agreed. I think this would be great. Kind of like the XCOPY framework that was 
proposed a while back for SCSI devices [1] but updated to also include NVMe 
devices. That is definitely a use case we would like this framework to support.

Stephen
 
[1] https://lwn.net/Articles/592094/



Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Logan Gunthorpe


On 14/03/18 01:28 PM, Dan Williams wrote:
> P2P over PCI/PCI-X is quite common in devices like raid controllers.
> It would be useful if those configurations were not left behind so
> that Linux could feasibly deploy offload code to a controller in the
> PCI domain.

Thanks for the note. Neat. In the end nothing is getting left behind
it's just work for someone to add support. Even if I wasn't already
going to make the change I mentioned it all fits in the architecture and
APIs quite easily.

Logan



Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Dan Williams
On Wed, Mar 14, 2018 at 12:03 PM, Logan Gunthorpe  wrote:
>
>
> On 14/03/18 12:51 PM, Bjorn Helgaas wrote:
>> You are focused on PCIe systems, and in those systems, most topologies
>> do have an upstream switch, which means two upstream bridges.  I'm
>> trying to remove that assumption because I don't think there's a
>> requirement for it in the spec.  Enforcing this assumption complicates
>> the code and makes it harder to understand because the reader says
>> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
>> so why do we need these two bridges?"
>
> Yes, as I've said, we focused on being behind a single PCIe Switch
> because it's easier and vaguely safer (we *know* switches will work but
> other types of topology we have to assume will work based on the spec).
> Also, I have my doubts that anyone will ever have a use for this with
> non-PCIe devices.

P2P over PCI/PCI-X is quite common in devices like raid controllers.
It would be useful if those configurations were not left behind so
that Linux could feasibly deploy offload code to a controller in the
PCI domain.


Re: [PATCH v7 8/9] bcache: add io_disable to struct cached_dev

2018-03-14 Thread Michael Lyle
On 02/27/2018 08:55 AM, Coly Li wrote:
> If a bcache device is configured to writeback mode, current code does not
> handle write I/O errors on backing devices properly.
> 
> In writeback mode, write request is written to cache device, and
> latter being flushed to backing device. If I/O failed when writing from
> cache device to the backing device, bcache code just ignores the error and
> upper layer code is NOT noticed that the backing device is broken.

lgtm, applied


Re: [PATCH v7 7/9] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O

2018-03-14 Thread Michael Lyle
LGTM, applied (sorry if this is duplicated, had mail client problems)

On 02/27/2018 08:55 AM, Coly Li wrote:
> In order to catch I/O error of backing device, a separate bi_end_io
> call back is required. Then a per backing device counter can record I/O
> errors number and retire the backing device if the counter reaches a
> per backing device I/O error limit.
> 
> This patch adds backing_request_endio() to bcache backing device I/O code
> path, this is a preparation for further complicated backing device failure
> handling. So far there is no real code logic change, I make this change a
> separate patch to make sure it is stable and reliable for further work.
> 
> Changelog:
> v2: Fix code comments typo, remove a redundant bch_writeback_add() line
> added in v4 patch set.
> v1: indeed this is new added in this patch set.
> 
> Signed-off-by: Coly Li 
> Reviewed-by: Hannes Reinecke 
> Cc: Junhui Tang 
> Cc: Michael Lyle 
> ---
>  drivers/md/bcache/request.c   | 93 
> +++
>  drivers/md/bcache/super.c |  1 +
>  drivers/md/bcache/writeback.c |  1 +
>  3 files changed, 79 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 279c9266bf50..0c517dd806a5 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl)
>   }
>  
>   op->insert_data_done = true;
> + /* get in bch_data_insert() */
>   bio_put(bio);
>  out:
>   continue_at(cl, bch_data_insert_keys, op->wq);
> @@ -630,6 +631,38 @@ static void request_endio(struct bio *bio)
>   closure_put(cl);
>  }
>  
> +static void backing_request_endio(struct bio *bio)
> +{
> + struct closure *cl = bio->bi_private;
> +
> + if (bio->bi_status) {
> + struct search *s = container_of(cl, struct search, cl);
> + /*
> +  * If a bio has REQ_PREFLUSH for writeback mode, it is
> +  * speically assembled in cached_dev_write() for a non-zero
> +  * write request which has REQ_PREFLUSH. we don't set
> +  * s->iop.status by this failure, the status will be decided
> +  * by result of bch_data_insert() operation.
> +  */
> + if (unlikely(s->iop.writeback &&
> +  bio->bi_opf & REQ_PREFLUSH)) {
> + char buf[BDEVNAME_SIZE];
> +
> + bio_devname(bio, buf);
> + pr_err("Can't flush %s: returned bi_status %i",
> + buf, bio->bi_status);
> + } else {
> + /* set to orig_bio->bi_status in bio_complete() */
> + s->iop.status = bio->bi_status;
> + }
> + s->recoverable = false;
> + /* should count I/O error for backing device here */
> + }
> +
> + bio_put(bio);
> + closure_put(cl);
> +}
> +
>  static void bio_complete(struct search *s)
>  {
>   if (s->orig_bio) {
> @@ -644,13 +677,21 @@ static void bio_complete(struct search *s)
>   }
>  }
>  
> -static void do_bio_hook(struct search *s, struct bio *orig_bio)
> +static void do_bio_hook(struct search *s,
> + struct bio *orig_bio,
> + bio_end_io_t *end_io_fn)
>  {
>   struct bio *bio = &s->bio.bio;
>  
>   bio_init(bio, NULL, 0);
>   __bio_clone_fast(bio, orig_bio);
> - bio->bi_end_io  = request_endio;
> + /*
> +  * bi_end_io can be set separately somewhere else, e.g. the
> +  * variants in,
> +  * - cache_bio->bi_end_io from cached_dev_cache_miss()
> +  * - n->bi_end_io from cache_lookup_fn()
> +  */
> + bio->bi_end_io  = end_io_fn;
>   bio->bi_private = &s->cl;
>  
>   bio_cnt_set(bio, 3);
> @@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio,
>   s = mempool_alloc(d->c->search, GFP_NOIO);
>  
>   closure_init(&s->cl, NULL);
> - do_bio_hook(s, bio);
> + do_bio_hook(s, bio, request_endio);
>  
>   s->orig_bio = bio;
>   s->cache_miss   = NULL;
> @@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl)
>   trace_bcache_read_retry(s->orig_bio);
>  
>   s->iop.status = 0;
> - do_bio_hook(s, s->orig_bio);
> + do_bio_hook(s, s->orig_bio, backing_request_endio);
>  
>   /* XXX: invalidate cache */
>  
> + /* I/O request sent to backing device */
>   closure_bio_submit(s->iop.c, bio, cl);
>   }
>  
> @@ -859,7 +901,7 @@ static int cached_dev_cache_miss(struct btree *b, struct 
> search *s,
>   bio_copy_dev(cache_bio, miss);
>   cache_bio->bi_iter.bi_size  = s->insert_bio_sectors << 9;
>  
> - cache_bio->bi_end_io= request_endio;
> + cache_bio->bi_end_io= backing_request_endio;
>   cache_

Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Logan Gunthorpe


On 14/03/18 12:51 PM, Bjorn Helgaas wrote:
> You are focused on PCIe systems, and in those systems, most topologies
> do have an upstream switch, which means two upstream bridges.  I'm
> trying to remove that assumption because I don't think there's a
> requirement for it in the spec.  Enforcing this assumption complicates
> the code and makes it harder to understand because the reader says
> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
> so why do we need these two bridges?"

Yes, as I've said, we focused on being behind a single PCIe Switch
because it's easier and vaguely safer (we *know* switches will work but
other types of topology we have to assume will work based on the spec).
Also, I have my doubts that anyone will ever have a use for this with
non-PCIe devices.

A switch shows up as two or more virtual bridges (per the PCIe v4 Spec
1.3.3) which explains the existing get_upstream_bridge_port() function.

In any case, we'll look at generalizing this by looking for a common
upstream port in the next revision of the patch set.

Logan




Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Bjorn Helgaas
On Wed, Mar 14, 2018 at 10:17:34AM -0600, Logan Gunthorpe wrote:
> On 13/03/18 08:56 PM, Bjorn Helgaas wrote:
> > I agree that peers need to have a common upstream bridge.  I think
> > you're saying peers need to have *two* common upstream bridges.  If I
> > understand correctly, requiring two common bridges is a way to ensure
> > that peers directly below Root Ports don't try to DMA to each other.
> 
> No, I don't get where you think we need to have two common upstream
> bridges. I'm not sure when such a case would ever happen. But you seem
> to understand based on what you wrote below.

Sorry, I phrased that wrong.  You don't require two common upstream
bridges; you require two upstream bridges, with the upper one being
common, i.e.,

  static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
  {
struct pci_dev *up1, *up2;

up1 = pci_dev_get(pci_upstream_bridge(pdev));
up2 = pci_dev_get(pci_upstream_bridge(up1));
return up2;
  }

So if you're starting with pdev, up1 is the immediately upstream
bridge and up2 is the second upstream bridge.  If this is PCIe, up1
may be a Root Port and there is no up2, or up1 and up2 are in a
switch.

This is more restrictive than the spec requires.  As long as there is
a single common upstream bridge, peer-to-peer DMA should work.  In
fact, in conventional PCI, I think the upstream bridge could even be
the host bridge (not a PCI-to-PCI bridge).

You are focused on PCIe systems, and in those systems, most topologies
do have an upstream switch, which means two upstream bridges.  I'm
trying to remove that assumption because I don't think there's a
requirement for it in the spec.  Enforcing this assumption complicates
the code and makes it harder to understand because the reader says
"huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
so why do we need these two bridges?"

[*] For conventional PCI, this means anything below the same host
bridge.  Two devices on a conventional PCI root bus should be able to
DMA to each other, even though there's no PCI-to-PCI bridge above
them.  For PCIe, it means a "hierarchy domain" as used in PCIe r4.0,
sec 1.3.1, i.e., anything below the same Root Port.

> > So I guess the first order of business is to nail down whether peers
> > below a Root Port are prohibited from DMAing to each other.  My
> > assumption, based on 6.12.1.2 and the fact that I haven't yet found
> > a prohibition, is that they can.
> 
> If you have a multifunction device designed to DMA to itself below a
> root port, it can. But determining this is on a device by device basis,
> just as determining whether a root complex can do peer to peer is on a
> per device basis. So I'd say we don't want to allow it by default and
> let someone who has such a device figure out what's necessary if and
> when one comes along.

It's not the job of this infrastructure to answer the device-dependent
question of whether DMA initiators or targets support peer-to-peer
DMA.

All we want to do here is figure out whether the PCI topology supports
it, using the mechanisms guaranteed by the spec.  We can derive that
from the basic rules about how PCI bridges work, i.e., from the
PCI-to-PCI Bridge spec r1.2, sec 4.3:

  A bridge forwards PCI memory transactions from its primary interface
  to its secondary interface (downstream) if a memory address is in
  the range defined by the Memory Base and Memory Limit registers
  (when the base is less than or equal to the limit) as illustrated in
  Figure 4-3. Conversely, a memory transaction on the secondary
  interface that is within this address range will not be forwarded
  upstream to the primary interface. Any memory transactions on the
  secondary interface that are outside this address range will be
  forwarded upstream to the primary interface (provided they are not
  in the address range defined by the prefetchable memory address
  range registers).

This works for either PCI or PCIe.  The only wrinkle PCIe adds is that
the very top of the hierarchy is a Root Port, and we can't rely on it
to route traffic to other Root Ports.  I also doubt Root Complex
Integrated Endpoints can participate in peer-to-peer DMA.

Thanks for your patience in working through all this.  I know it
sometimes feels like being bounced around in all directions.  It's
just a normal consequence of trying to add complex functionality to an
already complex system, with interest and expertise spread unevenly
across a crowd of people.

Bjorn


Re: [PATCH v7 8/9] bcache: add io_disable to struct cached_dev

2018-03-14 Thread Michael Lyle
LGTM, applying.

On 02/27/2018 08:55 AM, Coly Li wrote:
> If a bcache device is configured to writeback mode, current code does not
> handle write I/O errors on backing devices properly.
> 
> In writeback mode, write request is written to cache device, and
> latter being flushed to backing device. If I/O failed when writing from
> cache device to the backing device, bcache code just ignores the error and
> upper layer code is NOT noticed that the backing device is broken.
> 
> This patch tries to handle backing device failure like how the cache device
> failure is handled,
> - Add a error counter 'io_errors' and error limit 'error_limit' in struct
>   cached_dev. Add another io_disable to struct cached_dev to disable I/Os
>   on the problematic backing device.
> - When I/O error happens on backing device, increase io_errors counter. And
>   if io_errors reaches error_limit, set cache_dev->io_disable to true, and
>   stop the bcache device.
> 
> The result is, if backing device is broken of disconnected, and I/O errors
> reach its error limit, backing device will be disabled and the associated
> bcache device will be removed from system.
> 
> Changelog:
> v2: remove "bcache: " prefix in pr_error(), and use correct name string to
> print out bcache device gendisk name.
> v1: indeed this is new added in v2 patch set.
> 
> Signed-off-by: Coly Li 
> Reviewed-by: Hannes Reinecke 
> Cc: Michael Lyle 
> Cc: Junhui Tang 
> ---
>  drivers/md/bcache/bcache.h  |  6 ++
>  drivers/md/bcache/io.c  | 14 ++
>  drivers/md/bcache/request.c | 14 --
>  drivers/md/bcache/super.c   | 21 +
>  drivers/md/bcache/sysfs.c   | 15 ++-
>  5 files changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 5e9f3610c6fd..d338b7086013 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -367,6 +367,7 @@ struct cached_dev {
>   unsignedsequential_cutoff;
>   unsignedreadahead;
>  
> + unsignedio_disable:1;
>   unsignedverify:1;
>   unsignedbypass_torture_test:1;
>  
> @@ -388,6 +389,9 @@ struct cached_dev {
>   unsignedwriteback_rate_minimum;
>  
>   enum stop_on_failurestop_when_cache_set_failed;
> +#define DEFAULT_CACHED_DEV_ERROR_LIMIT   64
> + atomic_tio_errors;
> + unsignederror_limit;
>  };
>  
>  enum alloc_reserve {
> @@ -911,6 +915,7 @@ static inline void wait_for_kthread_stop(void)
>  
>  /* Forward declarations */
>  
> +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio);
>  void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
>  void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
> blk_status_t, const char *);
> @@ -938,6 +943,7 @@ int bch_bucket_alloc_set(struct cache_set *, unsigned,
>struct bkey *, int, bool);
>  bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned,
>  unsigned, unsigned, bool);
> +bool bch_cached_dev_error(struct cached_dev *dc);
>  
>  __printf(2, 3)
>  bool bch_cache_set_error(struct cache_set *, const char *, ...);
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index 8013ecbcdbda..7fac97ae036e 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -50,6 +50,20 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c,
>  }
>  
>  /* IO errors */
> +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio)
> +{
> + char buf[BDEVNAME_SIZE];
> + unsigned errors;
> +
> + WARN_ONCE(!dc, "NULL pointer of struct cached_dev");
> +
> + errors = atomic_add_return(1, &dc->io_errors);
> + if (errors < dc->error_limit)
> + pr_err("%s: IO error on backing device, unrecoverable",
> + bio_devname(bio, buf));
> + else
> + bch_cached_dev_error(dc);
> +}
>  
>  void bch_count_io_errors(struct cache *ca,
>blk_status_t error,
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 0c517dd806a5..d7a463e0250e 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -637,6 +637,8 @@ static void backing_request_endio(struct bio *bio)
>  
>   if (bio->bi_status) {
>   struct search *s = container_of(cl, struct search, cl);
> + struct cached_dev *dc = container_of(s->d,
> +  struct cached_dev, disk);
>   /*
>* If a bio has REQ_PREFLUSH for writeback mode, it is
>* speically assembled in cached_dev_write() for a non-zero
> @@ -657,6 +659,7 @@ static void backing_request_endio(struct bio *bio)
>   }
>   s->recoverable = false;
>   

Re: [PATCH v7 7/9] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O

2018-03-14 Thread Michael Lyle
LGTM, applying

On 02/27/2018 08:55 AM, Coly Li wrote:
> In order to catch I/O error of backing device, a separate bi_end_io
> call back is required. Then a per backing device counter can record I/O
> errors number and retire the backing device if the counter reaches a
> per backing device I/O error limit.
> 
> This patch adds backing_request_endio() to bcache backing device I/O code
> path, this is a preparation for further complicated backing device failure
> handling. So far there is no real code logic change, I make this change a
> separate patch to make sure it is stable and reliable for further work.
> 
> Changelog:
> v2: Fix code comments typo, remove a redundant bch_writeback_add() line
> added in v4 patch set.
> v1: indeed this is new added in this patch set.
> 
> Signed-off-by: Coly Li 
> Reviewed-by: Hannes Reinecke 
> Cc: Junhui Tang 
> Cc: Michael Lyle 
> ---
>  drivers/md/bcache/request.c   | 93 
> +++
>  drivers/md/bcache/super.c |  1 +
>  drivers/md/bcache/writeback.c |  1 +
>  3 files changed, 79 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 279c9266bf50..0c517dd806a5 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl)
>   }
>  
>   op->insert_data_done = true;
> + /* get in bch_data_insert() */
>   bio_put(bio);
>  out:
>   continue_at(cl, bch_data_insert_keys, op->wq);
> @@ -630,6 +631,38 @@ static void request_endio(struct bio *bio)
>   closure_put(cl);
>  }
>  
> +static void backing_request_endio(struct bio *bio)
> +{
> + struct closure *cl = bio->bi_private;
> +
> + if (bio->bi_status) {
> + struct search *s = container_of(cl, struct search, cl);
> + /*
> +  * If a bio has REQ_PREFLUSH for writeback mode, it is
> +  * speically assembled in cached_dev_write() for a non-zero
> +  * write request which has REQ_PREFLUSH. we don't set
> +  * s->iop.status by this failure, the status will be decided
> +  * by result of bch_data_insert() operation.
> +  */
> + if (unlikely(s->iop.writeback &&
> +  bio->bi_opf & REQ_PREFLUSH)) {
> + char buf[BDEVNAME_SIZE];
> +
> + bio_devname(bio, buf);
> + pr_err("Can't flush %s: returned bi_status %i",
> + buf, bio->bi_status);
> + } else {
> + /* set to orig_bio->bi_status in bio_complete() */
> + s->iop.status = bio->bi_status;
> + }
> + s->recoverable = false;
> + /* should count I/O error for backing device here */
> + }
> +
> + bio_put(bio);
> + closure_put(cl);
> +}
> +
>  static void bio_complete(struct search *s)
>  {
>   if (s->orig_bio) {
> @@ -644,13 +677,21 @@ static void bio_complete(struct search *s)
>   }
>  }
>  
> -static void do_bio_hook(struct search *s, struct bio *orig_bio)
> +static void do_bio_hook(struct search *s,
> + struct bio *orig_bio,
> + bio_end_io_t *end_io_fn)
>  {
>   struct bio *bio = &s->bio.bio;
>  
>   bio_init(bio, NULL, 0);
>   __bio_clone_fast(bio, orig_bio);
> - bio->bi_end_io  = request_endio;
> + /*
> +  * bi_end_io can be set separately somewhere else, e.g. the
> +  * variants in,
> +  * - cache_bio->bi_end_io from cached_dev_cache_miss()
> +  * - n->bi_end_io from cache_lookup_fn()
> +  */
> + bio->bi_end_io  = end_io_fn;
>   bio->bi_private = &s->cl;
>  
>   bio_cnt_set(bio, 3);
> @@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio,
>   s = mempool_alloc(d->c->search, GFP_NOIO);
>  
>   closure_init(&s->cl, NULL);
> - do_bio_hook(s, bio);
> + do_bio_hook(s, bio, request_endio);
>  
>   s->orig_bio = bio;
>   s->cache_miss   = NULL;
> @@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl)
>   trace_bcache_read_retry(s->orig_bio);
>  
>   s->iop.status = 0;
> - do_bio_hook(s, s->orig_bio);
> + do_bio_hook(s, s->orig_bio, backing_request_endio);
>  
>   /* XXX: invalidate cache */
>  
> + /* I/O request sent to backing device */
>   closure_bio_submit(s->iop.c, bio, cl);
>   }
>  
> @@ -859,7 +901,7 @@ static int cached_dev_cache_miss(struct btree *b, struct 
> search *s,
>   bio_copy_dev(cache_bio, miss);
>   cache_bio->bi_iter.bi_size  = s->insert_bio_sectors << 9;
>  
> - cache_bio->bi_end_io= request_endio;
> + cache_bio->bi_end_io= backing_request_endio;
>   cache_bio->bi_private   = &s->cl;
>  
>   bch_bio_map(cac

Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Logan Gunthorpe


On 14/03/18 06:16 AM, David Laight wrote:
> That surprises me (unless I missed something last time I read the spec).
> While P2P writes are relatively easy to handle, reads and any other TLP that
> require acks are a completely different proposition.
> There are no additional fields that can be set in the read TLP and will be
> reflected back in the ack(s) than can be used to route the acks back to the
> correct initiator.
>
> I'm pretty sure that to support P2P reads a switch would have to save
> the received read TLP and (possibly later on) issue read TLP of its own
> for the required data.
> I'm not even sure it is easy to interleave the P2P reads with those
> coming from the root.
> That requires a potentially infinite queue of pending requests.

This is wrong. A completion is a TLP just like any other and makes use
of the Destination ID field in the header to route it back to the
original requester.

> Some x86 root ports support P2P writes (maybe with a bios option).
> It would be a shame not to be able to do P2P writes on such systems
> even though P2P reads won't work.

Yes, and this has been discussed many times. It won't be changing in the
near term.

Logan


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Logan Gunthorpe


On 13/03/18 08:56 PM, Bjorn Helgaas wrote:
> I assume you want to exclude Root Ports because of multi-function
> devices and the "route to self" error.  I was hoping for a reference
> to that so I could learn more about it.

I haven't been able to find where in the spec it forbids route to self.
But I was told this by developers who work know switches. Hopefully
Stephen can find the reference.

But it's a bit of a moot point. Devices can DMA to themselves if they
are designed to do so. For example, some NVMe cards can read and write
their own CMB for certain types of DMA request. There is a register in
the spec (CMBSZ) which specifies which types of requests are supported.
(See 3.1.12 in NVMe 1.3a).

> I agree that peers need to have a common upstream bridge.  I think
> you're saying peers need to have *two* common upstream bridges.  If I
> understand correctly, requiring two common bridges is a way to ensure
> that peers directly below Root Ports don't try to DMA to each other.

No, I don't get where you think we need to have two common upstream
bridges. I'm not sure when such a case would ever happen. But you seem
to understand based on what you wrote below.

> So I guess the first order of business is to nail down whether peers
> below a Root Port are prohibited from DMAing to each other.  My
> assumption, based on 6.12.1.2 and the fact that I haven't yet found
> a prohibition, is that they can.

If you have a multifunction device designed to DMA to itself below a
root port, it can. But determining this is on a device by device basis,
just as determining whether a root complex can do peer to peer is on a
per device basis. So I'd say we don't want to allow it by default and
let someone who has such a device figure out what's necessary if and
when one comes along.

> You already have upstream_bridges_match(), which takes two pci_devs.
> I think it should walk up the PCI hierarchy from the first device,
> checking whether the bridge at each level is also a parent of the
> second device.

Yes, this is what I meant when I said walking the entire tree. I've been
kicking the can down the road on implementing this as getting ref
counting right and testing it is going to be quite tricky. The single
switch approach we implemented now is just a simplification which works
for a single switch. But I guess we can look at implementing it this way
for v4.

Logan


[PATCH v3] block: bio_check_eod() needs to consider partitions

2018-03-14 Thread Christoph Hellwig
bio_check_eod() should check partiton size not the whole disk if
bio->bi_partno is non-zero.  Does this by taking the call to bio_check_eod
into blk_partition_remap.

Based on an earlier patch from Jiufei Xue.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and 
partitions index")
Reported-by: Jiufei Xue 
Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 93 
 1 file changed, 40 insertions(+), 53 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d82c4f7fadd..47ee24611126 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2023,7 +2023,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
return BLK_QC_T_NONE;
 }
 
-static void handle_bad_sector(struct bio *bio)
+static void handle_bad_sector(struct bio *bio, sector_t maxsector)
 {
char b[BDEVNAME_SIZE];
 
@@ -2031,7 +2031,7 @@ static void handle_bad_sector(struct bio *bio)
printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n",
bio_devname(bio, b), bio->bi_opf,
(unsigned long long)bio_end_sector(bio),
-   (long long)get_capacity(bio->bi_disk));
+   (long long)maxsector);
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
@@ -2092,68 +2092,59 @@ static noinline int should_fail_bio(struct bio *bio)
 }
 ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
 
+/*
+ * Check whether this bio extends beyond the end of the device or partition.
+ * This may well happen - the kernel calls bread() without checking the size of
+ * the device, e.g., when mounting a file system.
+ */
+static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
+{
+   unsigned int nr_sectors = bio_sectors(bio);
+
+   if (nr_sectors && maxsector &&
+   (nr_sectors > maxsector ||
+bio->bi_iter.bi_sector > maxsector - nr_sectors)) {
+   handle_bad_sector(bio, maxsector);
+   return -EIO;
+   }
+   return 0;
+}
+
 /*
  * Remap block n of partition p to block n+start(p) of the disk.
  */
 static inline int blk_partition_remap(struct bio *bio)
 {
struct hd_struct *p;
-   int ret = 0;
+   int ret = -EIO;
 
rcu_read_lock();
p = __disk_get_part(bio->bi_disk, bio->bi_partno);
-   if (unlikely(!p || should_fail_request(p, bio->bi_iter.bi_size) ||
-bio_check_ro(bio, p))) {
-   ret = -EIO;
+   if (unlikely(!p))
+   goto out;
+   if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
+   goto out;
+   if (unlikely(bio_check_ro(bio, p)))
goto out;
-   }
 
/*
 * Zone reset does not include bi_size so bio_sectors() is always 0.
 * Include a test for the reset op code and perform the remap if needed.
 */
-   if (!bio_sectors(bio) && bio_op(bio) != REQ_OP_ZONE_RESET)
-   goto out;
-
-   bio->bi_iter.bi_sector += p->start_sect;
-   bio->bi_partno = 0;
-   trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p),
- bio->bi_iter.bi_sector - p->start_sect);
-
+   if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET) {
+   if (bio_check_eod(bio, part_nr_sects_read(p)))
+   goto out;
+   bio->bi_iter.bi_sector += p->start_sect;
+   bio->bi_partno = 0;
+   trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p),
+ bio->bi_iter.bi_sector - p->start_sect);
+   }
+   ret = 0;
 out:
rcu_read_unlock();
return ret;
 }
 
-/*
- * Check whether this bio extends beyond the end of the device.
- */
-static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
-{
-   sector_t maxsector;
-
-   if (!nr_sectors)
-   return 0;
-
-   /* Test device or partition size, when known. */
-   maxsector = get_capacity(bio->bi_disk);
-   if (maxsector) {
-   sector_t sector = bio->bi_iter.bi_sector;
-
-   if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {
-   /*
-* This may well happen - the kernel calls bread()
-* without checking the size of the device, e.g., when
-* mounting a device.
-*/
-   handle_bad_sector(bio);
-   return 1;
-   }
-   }
-
-   return 0;
-}
-
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -2164,9 +2155,6 @@ generic_make_request_checks(struct bio *bio)
 
might_sleep();
 
-   if (bio_check_eod(bio, nr_sectors))
-   goto end_io;
-
q = bio->bi_disk->queue;
if (unlikely(!q)) {
printk(KERN_ERR
@@ -2186,17 +2174,16 @@ generic_make_request

Re: [PATCH] dm mpath: fix passing integrity data

2018-03-14 Thread Martin K. Petersen

Steffen,

> After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity
> data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support
> block integrity any more. So add it to the whitelist.

Ugh.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] dm mpath: fix passing integrity data

2018-03-14 Thread Hannes Reinecke
On 03/14/2018 03:33 PM, Steffen Maier wrote:
> After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity
> data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support
> block integrity any more. So add it to the whitelist.
> 
> This is also a pre-requisite to use block integrity with other dm layer(s)
> on top of multipath, such as kpartx partitions (dm-linear) or LVM.
> 
> Signed-off-by: Steffen Maier 
> Bisected-by: Fedor Loshakov 
> Fixes: e2460f2a4bc7 ("dm: mark targets that pass integrity data")
> Cc:  #4.12+
> ---
>  drivers/md/dm-mpath.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 3fde9e9faddd..c174f0c53dc9 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -2023,7 +2023,8 @@ static int multipath_busy(struct dm_target *ti)
>  static struct target_type multipath_target = {
>   .name = "multipath",
>   .version = {1, 12, 0},
> - .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE,
> + .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE |
> + DM_TARGET_PASSES_INTEGRITY,
>   .module = THIS_MODULE,
>   .ctr = multipath_ctr,
>   .dtr = multipath_dtr,
> 
Ho-hum.
Thanks for this.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v2] block: bio_check_eod() needs to consider partition

2018-03-14 Thread Bart Van Assche
On Wed, 2018-03-14 at 14:03 +0100, h...@lst.de wrote:
> can you test the version below?

Hello Christoph,

The same VM that failed to boot with v2 of this patch boots fine with this
patch.

Thanks,

Bart.





Re: [PATCH V5 1/5] scsi: hpsa: fix selection of reply queue

2018-03-14 Thread Bityutskiy, Artem
On Tue, 2018-03-13 at 17:42 +0800, Ming Lei wrote:
> From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
> one msix vector can be created without any online CPU mapped, then one
> command's completion may not be notified.
> 
> This patch setups mapping between cpu and reply queue according to irq
> affinity info retrived by pci_irq_get_affinity(), and uses this mapping
> table to choose reply queue for queuing one command.
> 
> Then the chosen reply queue has to be active, and fixes IO hang caused
> by using inactive reply queue which doesn't have any online CPU mapped.
> 
> Cc: Hannes Reinecke 
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Laurence Oberman 
> Cc: Meelis Roos 
> Cc: Artem Bityutskiy 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Tested-by: Don Brace 
> Tested-by: Artem Bityutskiy 
> Acked-by: Don Brace 
> Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
> Signed-off-by: Ming Lei 

Checked v5 my Skylake Xeon and with this patch the regression that I
reported is fixed.

Tested-by: Artem Bityutskiy 
Link: https://lkml.kernel.org/r/1519311270.2535.53.ca...@intel.com
-
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki 
Business Identity Code: 0357606 - 4 
Domiciled in Helsinki 

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: [PATCH V5 2/5] scsi: megaraid_sas: fix selection of reply queue

2018-03-14 Thread Artem Bityutskiy
On Tue, 2018-03-13 at 17:42 +0800, Ming Lei wrote:
> From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
> one msix vector can be created without any online CPU mapped, then
> command may be queued, and won't be notified after its completion.
> 
> This patch setups mapping between cpu and reply queue according to irq
> affinity info retrived by pci_irq_get_affinity(), and uses this info
> to choose reply queue for queuing one command.
> 
> Then the chosen reply queue has to be active, and fixes IO hang caused
> by using inactive reply queue which doesn't have any online CPU mapped.
> 
> Cc: Hannes Reinecke 
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Laurence Oberman 
> Cc: Mike Snitzer 
> Cc: Meelis Roos 
> Cc: Artem Bityutskiy 
> Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
> Signed-off-by: Ming Lei 

Checked v5 my Skylake Xeon and with this patch the regression that I reported 
is fixed.

Tested-by: Artem Bityutskiy 
Link: https://lkml.kernel.org/r/1519311270.2535.53.ca...@intel.com


[PATCH 11/16] treewide: simplify Kconfig dependencies for removed archs

2018-03-14 Thread Arnd Bergmann
A lot of Kconfig symbols have architecture specific dependencies.
In those cases that depend on architectures we have already removed,
they can be omitted.

Signed-off-by: Arnd Bergmann 
---
 block/bounce.c   |  2 +-
 drivers/ide/Kconfig  |  2 +-
 drivers/ide/ide-generic.c| 12 +---
 drivers/input/joystick/analog.c  |  2 +-
 drivers/isdn/hisax/Kconfig   | 10 +-
 drivers/net/ethernet/davicom/Kconfig |  2 +-
 drivers/net/ethernet/smsc/Kconfig|  6 +++---
 drivers/net/wireless/cisco/Kconfig   |  2 +-
 drivers/pwm/Kconfig  |  2 +-
 drivers/rtc/Kconfig  |  2 +-
 drivers/spi/Kconfig  |  4 ++--
 drivers/usb/musb/Kconfig |  2 +-
 drivers/video/console/Kconfig|  3 +--
 drivers/watchdog/Kconfig |  6 --
 drivers/watchdog/Makefile|  6 --
 fs/Kconfig.binfmt|  5 ++---
 fs/minix/Kconfig |  2 +-
 include/linux/ide.h  |  7 +--
 init/Kconfig |  5 ++---
 lib/Kconfig.debug| 13 +
 lib/test_user_copy.c |  2 --
 mm/Kconfig   |  7 ---
 mm/percpu.c  |  4 
 23 files changed, 31 insertions(+), 77 deletions(-)

diff --git a/block/bounce.c b/block/bounce.c
index 6a3e68292273..dd0b93f2a871 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -31,7 +31,7 @@
 static struct bio_set *bounce_bio_set, *bounce_bio_split;
 static mempool_t *page_pool, *isa_page_pool;
 
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_NEED_BOUNCE_POOL)
+#if defined(CONFIG_HIGHMEM)
 static __init int init_emergency_pool(void)
 {
 #if defined(CONFIG_HIGHMEM) && !defined(CONFIG_MEMORY_HOTPLUG)
diff --git a/drivers/ide/Kconfig b/drivers/ide/Kconfig
index cf1fb3fb5d26..901b8833847f 100644
--- a/drivers/ide/Kconfig
+++ b/drivers/ide/Kconfig
@@ -200,7 +200,7 @@ comment "IDE chipset support/bugfixes"
 
 config IDE_GENERIC
tristate "generic/default IDE chipset support"
-   depends on ALPHA || X86 || IA64 || M32R || MIPS || ARCH_RPC
+   depends on ALPHA || X86 || IA64 || MIPS || ARCH_RPC
default ARM && ARCH_RPC
help
  This is the generic IDE driver.  This driver attaches to the
diff --git a/drivers/ide/ide-generic.c b/drivers/ide/ide-generic.c
index 54d7c4685d23..80c0d69b83ac 100644
--- a/drivers/ide/ide-generic.c
+++ b/drivers/ide/ide-generic.c
@@ -13,13 +13,10 @@
 #include 
 #include 
 
-/* FIXME: convert arm and m32r to use ide_platform host driver */
+/* FIXME: convert arm to use ide_platform host driver */
 #ifdef CONFIG_ARM
 #include 
 #endif
-#ifdef CONFIG_M32R
-#include 
-#endif
 
 #define DRV_NAME   "ide_generic"
 
@@ -35,13 +32,6 @@ static const struct ide_port_info ide_generic_port_info = {
 #ifdef CONFIG_ARM
 static const u16 legacy_bases[] = { 0x1f0 };
 static const int legacy_irqs[]  = { IRQ_HARDDISK };
-#elif defined(CONFIG_PLAT_M32700UT) || defined(CONFIG_PLAT_MAPPI2) || \
-  defined(CONFIG_PLAT_OPSPUT)
-static const u16 legacy_bases[] = { 0x1f0 };
-static const int legacy_irqs[]  = { PLD_IRQ_CFIREQ };
-#elif defined(CONFIG_PLAT_MAPPI3)
-static const u16 legacy_bases[] = { 0x1f0, 0x170 };
-static const int legacy_irqs[]  = { PLD_IRQ_CFIREQ, PLD_IRQ_IDEIREQ };
 #elif defined(CONFIG_ALPHA)
 static const u16 legacy_bases[] = { 0x1f0, 0x170, 0x1e8, 0x168 };
 static const int legacy_irqs[]  = { 14, 15, 11, 10 };
diff --git a/drivers/input/joystick/analog.c b/drivers/input/joystick/analog.c
index be1b4921f22a..eefac7978f93 100644
--- a/drivers/input/joystick/analog.c
+++ b/drivers/input/joystick/analog.c
@@ -163,7 +163,7 @@ static unsigned int get_time_pit(void)
 #define GET_TIME(x)do { x = (unsigned int)rdtsc(); } while (0)
 #define DELTA(x,y) ((y)-(x))
 #define TIME_NAME  "TSC"
-#elif defined(__alpha__) || defined(CONFIG_ARM) || defined(CONFIG_ARM64) || 
defined(CONFIG_RISCV) || defined(CONFIG_TILE)
+#elif defined(__alpha__) || defined(CONFIG_ARM) || defined(CONFIG_ARM64) || 
defined(CONFIG_RISCV)
 #define GET_TIME(x)do { x = get_cycles(); } while (0)
 #define DELTA(x,y) ((y)-(x))
 #define TIME_NAME  "get_cycles"
diff --git a/drivers/isdn/hisax/Kconfig b/drivers/isdn/hisax/Kconfig
index eb83d94ab4fe..38cfc8baae19 100644
--- a/drivers/isdn/hisax/Kconfig
+++ b/drivers/isdn/hisax/Kconfig
@@ -109,7 +109,7 @@ config HISAX_16_3
 
 config HISAX_TELESPCI
bool "Teles PCI"
-   depends on PCI && (BROKEN || !(SPARC || PPC || PARISC || M68K || (MIPS 
&& !CPU_LITTLE_ENDIAN) || FRV || (XTENSA && !CPU_LITTLE_ENDIAN)))
+   depends on PCI && (BROKEN || !(SPARC || PPC || PARISC || M68K || (MIPS 
&& !CPU_LITTLE_ENDIAN) || (XTENSA && !CPU_LITTLE_ENDIAN)))
help
  This enables HiSax support for the Teles PCI.
  See  on how to configure it.
@@ -237,7 +237,7 @@ config HISAX_MIC
 
 config HISAX_NETJET
bool "NET

[PATCH 00/16] remove eight obsolete architectures

2018-03-14 Thread Arnd Bergmann
Here is the collection of patches I have applied to my 'asm-generic' tree
on top of the 'metag' removal. This does not include any of the device
drivers, I'll send those separately to a someone different list of people.

The removal came out of a discussion that is now documented at
https://lwn.net/Articles/748074/

Following up from the state described there, I ended up removing the
mn10300, tile, blackfin and cris architectures directly, rather than
waiting, after consulting with the respective maintainers.

However, the unicore32 architecture is no longer part of the removal,
after its maintainer Xuetao Guan said that the port is still actively
being used and that he intends to keep working on it, and that he will
try to provide updated toolchain sources.

In the end, it seems that while the eight architectures are extremely
different, they all suffered the same fate: There was one company in
charge of an SoC line, a CPU microarchitecture and a software ecosystem,
which was more costly than licensing newer off-the-shelf CPU cores from
a third party (typically ARM, MIPS, or RISC-V). It seems that all the
SoC product lines are still around, but have not used the custom CPU
architectures for several years at this point.

  Arnd

Arnd Bergmann (14):
  arch: remove frv port
  arch: remove m32r port
  arch: remove score port
  arch: remove blackfin port
  arch: remove tile port
  procfs: remove CONFIG_HARDWALL dependency
  mm: remove blackfin MPU support
  mm: remove obsolete alloc_remap()
  treewide: simplify Kconfig dependencies for removed archs
  asm-generic: siginfo: remove obsolete #ifdefs
  Documentation: arch-support: remove obsolete architectures
  asm-generic: clean up asm/unistd.h
  recordmcount.pl: drop blackin and tile support
  ktest: remove obsolete architectures

David Howells (1):
  mn10300: Remove the architecture

Jesper Nilsson (1):
  CRIS: Drop support for the CRIS port

Dirstat only (full diffstat is over 100KB):

   6.3% arch/blackfin/mach-bf548/include/mach/
   4.5% arch/blackfin/mach-bf609/include/mach/
  26.3% arch/blackfin/
   4.1% arch/cris/arch-v32/
   5.6% arch/cris/include/arch-v32/arch/hwregs/iop/
   4.1% arch/cris/include/arch-v32/mach-a3/mach/hwregs/
   4.7% arch/cris/include/arch-v32/
   7.8% arch/cris/
   5.6% arch/frv/
   5.5% arch/m32r/
   7.0% arch/mn10300/
   7.6% arch/tile/include/
   6.4% arch/tile/kernel/
   0.0% Documentation/admin-guide/
   0.0% Documentation/blackfin/
   0.0% Documentation/cris/
   0.0% Documentation/devicetree/bindings/cris/
   0.0% Documentation/devicetree/bindings/interrupt-controller/
   2.8% Documentation/features/
   0.5% Documentation/frv/
   0.0% Documentation/ioctl/
   0.0% Documentation/mn10300/
   0.0% Documentation/
   0.0% block/
   0.0% crypto/
   0.0% drivers/ide/
   0.0% drivers/input/joystick/
   0.0% drivers/isdn/hisax/
   0.0% drivers/net/ethernet/davicom/
   0.0% drivers/net/ethernet/smsc/
   0.0% drivers/net/wireless/cisco/
   0.0% drivers/pci/
   0.0% drivers/pwm/
   0.0% drivers/rtc/
   0.0% drivers/spi/
   0.0% drivers/staging/speakup/
   0.0% drivers/usb/musb/
   0.0% drivers/video/console/
   0.0% drivers/watchdog/
   0.0% fs/minix/
   0.0% fs/proc/
   0.0% fs/
   0.0% include/asm-generic/
   0.0% include/linux/
   0.0% include/uapi/asm-generic/
   0.0% init/
   0.0% kernel/
   0.0% lib/
   0.0% mm/
   0.0% samples/blackfin/
   0.0% samples/kprobes/
   0.0% samples/
   0.0% scripts/mod/
   0.0% scripts/
   0.0% tools/arch/frv/include/uapi/asm/
   0.0% tools/arch/m32r/include/uapi/asm/
   0.0% tools/arch/mn10300/include/uapi/asm/
   0.0% tools/arch/score/include/uapi/asm/
   0.0% tools/arch/tile/include/asm/
   0.0% tools/arch/tile/include/uapi/asm/
   0.0% tools/include/asm-generic/
   0.0% tools/scripts/
   0.0% tools/testing/ktest/examples/
   0.0% tools/testing/ktest/

Cc: linux-...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-in...@vger.kernel.org
Cc: net...@vger.kernel.org
Cc: linux-wirel...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: dri-de...@lists.freedesktop.org
Cc: linux-fb...@vger.kernel.org
Cc: linux-watch...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org


[PATCH] dm mpath: fix passing integrity data

2018-03-14 Thread Steffen Maier
After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity
data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support
block integrity any more. So add it to the whitelist.

This is also a pre-requisite to use block integrity with other dm layer(s)
on top of multipath, such as kpartx partitions (dm-linear) or LVM.

Signed-off-by: Steffen Maier 
Bisected-by: Fedor Loshakov 
Fixes: e2460f2a4bc7 ("dm: mark targets that pass integrity data")
Cc:  #4.12+
---
 drivers/md/dm-mpath.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 3fde9e9faddd..c174f0c53dc9 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -2023,7 +2023,8 @@ static int multipath_busy(struct dm_target *ti)
 static struct target_type multipath_target = {
.name = "multipath",
.version = {1, 12, 0},
-   .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE,
+   .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE |
+   DM_TARGET_PASSES_INTEGRITY,
.module = THIS_MODULE,
.ctr = multipath_ctr,
.dtr = multipath_dtr,
-- 
2.13.5



Re: [PATCH v5] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-14 Thread Tejun Heo
Hello,

On Wed, Mar 14, 2018 at 02:18:04PM +0800, Joseph Qi wrote:
> Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
> blkcg_bio_issue_check()")
> Reported-by: Jiufei Xue 
> Cc: sta...@vger.kernel.org #4.3+

I'm a bit nervous about tagging it for -stable.  Given the low rate of
this actually occurring, I'm not sure the benefits outweigh the risks.
Let's at least cook it for a couple releases before sending it to
-stable.

> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 69bea82..dccd102 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -88,6 +88,7 @@ struct blkg_policy_data {
>   /* the blkg and policy id this per-policy data belongs to */
>   struct blkcg_gq *blkg;
>   int plid;
> + boolofflined;
>  };

This is pure bike-shedding but offlined reads kinda weird to me, maybe
just offline would read better?  Other than that,

 Acked-by: Tejun Heo 

Thanks a lot for seeing this through.

-- 
tejun


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Stephen Bates
>I assume you want to exclude Root Ports because of multi-function
>  devices and the "route to self" error.  I was hoping for a reference
>  to that so I could learn more about it.

Apologies Bjorn. This slipped through my net. I will try and get you a 
reference for RTS in the next couple of days.

> While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
> Functions in SR-IOV Capable and Multi-Function Devices", which seems
> relevant.  It talks about "peer-to-peer Requests (between Functions of
> the device)".  Thay says to me that multi-function devices can DMA
> between themselves.

I will go take a look. Appreciate the link.

Stephen 



Re: [PATCH v2] block: bio_check_eod() needs to consider partition

2018-03-14 Thread h...@lst.de
Hi Bart,

can you test the version below?

---
>From a68a8518158e31d66a0dc4f4e795ca3ceb83752c Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Tue, 13 Mar 2018 09:27:30 +0100
Subject: block: bio_check_eod() needs to consider partition

bio_check_eod() should check partiton size not the whole disk if
bio->bi_partno is non-zero.  Does this by taking the call to bio_check_eod
into blk_partition_remap.

Based on an earlier patch from Jiufei Xue.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and 
partitions index")
Reported-by: Jiufei Xue 
Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 93 
 1 file changed, 40 insertions(+), 53 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d82c4f7fadd..47ee24611126 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2023,7 +2023,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
return BLK_QC_T_NONE;
 }
 
-static void handle_bad_sector(struct bio *bio)
+static void handle_bad_sector(struct bio *bio, sector_t maxsector)
 {
char b[BDEVNAME_SIZE];
 
@@ -2031,7 +2031,7 @@ static void handle_bad_sector(struct bio *bio)
printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n",
bio_devname(bio, b), bio->bi_opf,
(unsigned long long)bio_end_sector(bio),
-   (long long)get_capacity(bio->bi_disk));
+   (long long)maxsector);
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
@@ -2092,68 +2092,59 @@ static noinline int should_fail_bio(struct bio *bio)
 }
 ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
 
+/*
+ * Check whether this bio extends beyond the end of the device or partition.
+ * This may well happen - the kernel calls bread() without checking the size of
+ * the device, e.g., when mounting a file system.
+ */
+static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
+{
+   unsigned int nr_sectors = bio_sectors(bio);
+
+   if (nr_sectors && maxsector &&
+   (nr_sectors > maxsector ||
+bio->bi_iter.bi_sector > maxsector - nr_sectors)) {
+   handle_bad_sector(bio, maxsector);
+   return -EIO;
+   }
+   return 0;
+}
+
 /*
  * Remap block n of partition p to block n+start(p) of the disk.
  */
 static inline int blk_partition_remap(struct bio *bio)
 {
struct hd_struct *p;
-   int ret = 0;
+   int ret = -EIO;
 
rcu_read_lock();
p = __disk_get_part(bio->bi_disk, bio->bi_partno);
-   if (unlikely(!p || should_fail_request(p, bio->bi_iter.bi_size) ||
-bio_check_ro(bio, p))) {
-   ret = -EIO;
+   if (unlikely(!p))
+   goto out;
+   if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
+   goto out;
+   if (unlikely(bio_check_ro(bio, p)))
goto out;
-   }
 
/*
 * Zone reset does not include bi_size so bio_sectors() is always 0.
 * Include a test for the reset op code and perform the remap if needed.
 */
-   if (!bio_sectors(bio) && bio_op(bio) != REQ_OP_ZONE_RESET)
-   goto out;
-
-   bio->bi_iter.bi_sector += p->start_sect;
-   bio->bi_partno = 0;
-   trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p),
- bio->bi_iter.bi_sector - p->start_sect);
-
+   if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET) {
+   if (bio_check_eod(bio, part_nr_sects_read(p)))
+   goto out;
+   bio->bi_iter.bi_sector += p->start_sect;
+   bio->bi_partno = 0;
+   trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p),
+ bio->bi_iter.bi_sector - p->start_sect);
+   }
+   ret = 0;
 out:
rcu_read_unlock();
return ret;
 }
 
-/*
- * Check whether this bio extends beyond the end of the device.
- */
-static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
-{
-   sector_t maxsector;
-
-   if (!nr_sectors)
-   return 0;
-
-   /* Test device or partition size, when known. */
-   maxsector = get_capacity(bio->bi_disk);
-   if (maxsector) {
-   sector_t sector = bio->bi_iter.bi_sector;
-
-   if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {
-   /*
-* This may well happen - the kernel calls bread()
-* without checking the size of the device, e.g., when
-* mounting a device.
-*/
-   handle_bad_sector(bio);
-   return 1;
-   }
-   }
-
-   return 0;
-}
-
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -2164,9 +2155,6 @@ generic_make_request_checks(struct bio *b

RE: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread David Laight
From: Logan Gunthorpe
> Sent: 13 March 2018 23:46
...
> As Stephen pointed out, it's a requirement of the PCIe spec that a
> switch supports P2P. If you want to sell a switch that does P2P with bad
> performance then that's on you to deal with.

That surprises me (unless I missed something last time I read the spec).
While P2P writes are relatively easy to handle, reads and any other TLP that
require acks are a completely different proposition.
There are no additional fields that can be set in the read TLP and will be
reflected back in the ack(s) than can be used to route the acks back to the
correct initiator.

I'm pretty sure that to support P2P reads a switch would have to save
the received read TLP and (possibly later on) issue read TLP of its own
for the required data.
I'm not even sure it is easy to interleave the P2P reads with those
coming from the root.
That requires a potentially infinite queue of pending requests.

Some x86 root ports support P2P writes (maybe with a bios option).
It would be a shame not to be able to do P2P writes on such systems
even though P2P reads won't work.

(We looked at using P2P transfers for some data, but in the end used
a different scheme.
For our use case P2P writes were enough.
An alternative would be to access the same host memory buffer from
two different devices - but there isn't an API that lets you do that.)

David




Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-14 Thread Dou Liyang

Hi Artern,

At 03/14/2018 05:07 PM, Artem Bityutskiy wrote:

On Wed, 2018-03-14 at 12:11 +0800, Dou Liyang wrote:

At 03/13/2018 05:35 PM, Rafael J. Wysocki wrote:

On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy

Longer term, yeah, I agree. Kernel's notion of possible CPU
count
should be realistic.


I did a patch for that, Artem, could you help me to test it.



I didn't consider the nr_cpu_ids before. please ignore the old patch
and
try the following RFC patch.


Sure I can help with testing a patch, could we please:

1. Start a new thread for this
2. Include ACPI forum/folks



OK,  I will do that right now.

Thanks,
dou


Thanks,
Artem.








Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem

2018-03-14 Thread Michal Hocko
On Tue 06-03-18 20:28:59, Tetsuo Handa wrote:
> Laura Abbott wrote:
> > On 02/26/2018 06:28 AM, Michal Hocko wrote:
> > > On Fri 23-02-18 11:51:41, Laura Abbott wrote:
> > >> Hi,
> > >>
> > >> The Fedora arm-32 build VMs have a somewhat long standing problem
> > >> of hanging when running mkfs.ext4 with a bunch of processes stuck
> > >> in D state. This has been seen as far back as 4.13 but is still
> > >> present on 4.14:
> > >>
> > > [...]
> > >> This looks like everything is blocked on the writeback completing but
> > >> the writeback has been throttled. According to the infra team, this 
> > >> problem
> > >> is _not_ seen without LPAE (i.e. only 4G of RAM). I did see
> > >> https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to
> > >> quite match since this seems to be completely stuck. Any suggestions to
> > >> narrow the problem down?
> > > 
> > > How much dirtyable memory does the system have? We do allow only lowmem
> > > to be dirtyable by default on 32b highmem systems. Maybe you have the
> > > lowmem mostly consumed by the kernel memory. Have you tried to enable
> > > highmem_is_dirtyable?
> > > 
> > 
> > Setting highmem_is_dirtyable did fix the problem. The infrastructure
> > people seemed satisfied enough with this (and are happy to have the
> > machines back).
> 
> That's good.
> 
> > I'll see if they are willing to run a few more tests
> > to get some more state information.
> 
> Well, I'm far from understanding what is happening in your case, but I'm
> interested in other threads which were trying to allocate memory. Therefore,
> I appreciate if they can take SysRq-m + SysRq-t than SysRq-w (as described
> at http://akari.osdn.jp/capturing-kernel-messages.html ).
> 
> Code which assumes that kswapd can make progress can get stuck when kswapd
> is blocked somewhere. And wbt_wait() seems to change behavior based on
> current_is_kswapd(). If everyone is waiting for kswapd but kswapd cannot
> make progress, I worry that it leads to hangups like your case.

Tetsuo, could you stop this finally, pretty please? This is a
well known limitation of 32b architectures with more than 4G. The lowmem
can only handle 896MB of memory and that can be filled up with other
kernel allocations. Stalled writeback is _usually_ a result of only
little dirtyable memory which is left in the lowmem. We cannot simply
allow highmem to be dirtyable by default due to reasons explained in
other email.

I can imagine that it is hard for you to grasp that not everything is
"silent hang during OOM" but there are other things going on in the VM.
-- 
Michal Hocko
SUSE Labs


Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem

2018-03-14 Thread Michal Hocko
On Mon 05-03-18 13:04:24, Laura Abbott wrote:
> On 02/26/2018 06:28 AM, Michal Hocko wrote:
> > On Fri 23-02-18 11:51:41, Laura Abbott wrote:
> > > Hi,
> > > 
> > > The Fedora arm-32 build VMs have a somewhat long standing problem
> > > of hanging when running mkfs.ext4 with a bunch of processes stuck
> > > in D state. This has been seen as far back as 4.13 but is still
> > > present on 4.14:
> > > 
> > [...]
> > > This looks like everything is blocked on the writeback completing but
> > > the writeback has been throttled. According to the infra team, this 
> > > problem
> > > is _not_ seen without LPAE (i.e. only 4G of RAM). I did see
> > > https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to
> > > quite match since this seems to be completely stuck. Any suggestions to
> > > narrow the problem down?
> > 
> > How much dirtyable memory does the system have? We do allow only lowmem
> > to be dirtyable by default on 32b highmem systems. Maybe you have the
> > lowmem mostly consumed by the kernel memory. Have you tried to enable
> > highmem_is_dirtyable?
> > 
> 
> Setting highmem_is_dirtyable did fix the problem. The infrastructure
> people seemed satisfied enough with this (and are happy to have the
> machines back). I'll see if they are willing to run a few more tests
> to get some more state information.

Please be aware that highmem_is_dirtyable is not for free. There are
some code paths which can only allocate from lowmem (e.g. block device
AFAIR) and those could fill up the whole lowmem without any throttling.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-14 Thread Artem Bityutskiy
On Wed, 2018-03-14 at 12:11 +0800, Dou Liyang wrote:
> > At 03/13/2018 05:35 PM, Rafael J. Wysocki wrote:
> > > On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy 
> > > > Longer term, yeah, I agree. Kernel's notion of possible CPU
> > > > count
> > > > should be realistic.
> > 
> > I did a patch for that, Artem, could you help me to test it.
> > 
> 
> I didn't consider the nr_cpu_ids before. please ignore the old patch
> and
> try the following RFC patch.

Sure I can help with testing a patch, could we please:

1. Start a new thread for this
2. Include ACPI forum/folks

Thanks,
Artem.


Re: [PATCH V5 5/5] scsi: virtio_scsi: unify scsi_host_template

2018-03-14 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH V5 4/5] scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity

2018-03-14 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH V5 1/5] scsi: hpsa: fix selection of reply queue

2018-03-14 Thread Christoph Hellwig
I still don't like the code duplication, but I guess I can fix this
up in one of the next merge windows myself..

Reviewed-by: Christoph Hellwig 


Re: [PATCH V5 2/5] scsi: megaraid_sas: fix selection of reply queue

2018-03-14 Thread Christoph Hellwig
Same as for hpsa..

Reviewed-by: Christoph Hellwig