Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On Wed, Mar 14, 2018 at 12:34 PM, Stephen Bates wrote: >> P2P over PCI/PCI-X is quite common in devices like raid controllers. > > Hi Dan > > Do you mean between PCIe devices below the RAID controller? Isn't it pretty > novel to be able to support PCIe EPs below a RAID controller (as opposed to > SCSI based devices)? I'm thinking of the classic I/O offload card where there's an NTB to an internal PCI bus that has a storage controller and raid offload engines.
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
Stephen, >> It would be useful if those configurations were not left behind so >> that Linux could feasibly deploy offload code to a controller in the >> PCI domain. > > Agreed. I think this would be great. Kind of like the XCOPY framework > that was proposed a while back for SCSI devices [1] but updated to also > include NVMe devices. That is definitely a use case we would like this > framework to support. I'm on my umpteenth rewrite of the block/SCSI offload code. It is not as protocol-agnostic as I would like in the block layer facing downwards. It has proven quite hard to reconcile token-based and EXTENDED COPY semantics along with the desire to support stacking. But from an application/filesystem perspective everything looks the same regardless of the intricacies of the device. Nothing is preventing us from supporting other protocols... -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH V5 0/5] SCSI: fix selection of reply(hw) queue
Ming, > The patches fixes reply queue(virt-queue on virtio-scsi) selection on > hpsa, megaraid_sa and virtio-scsi, and IO hang can be caused easily by > this issue. I clarified all the commit descriptions. There were also a bunch of duplicate review tags and other warnings. Please run checkpatch next time! Applied to 4.16/scsi-fixes. Thank you. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v5] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()
Hello Tejun, Thanks for your quick response. On 18/3/14 22:09, Tejun Heo wrote: > Hello, > > On Wed, Mar 14, 2018 at 02:18:04PM +0800, Joseph Qi wrote: >> Fixes: ae1188963611 ("blkcg: consolidate blkg creation in >> blkcg_bio_issue_check()") >> Reported-by: Jiufei Xue >> Cc: sta...@vger.kernel.org #4.3+ > > I'm a bit nervous about tagging it for -stable. Given the low rate of > this actually occurring, I'm not sure the benefits outweigh the risks. > Let's at least cook it for a couple releases before sending it to > -stable. > >> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h >> index 69bea82..dccd102 100644 >> --- a/include/linux/blk-cgroup.h >> +++ b/include/linux/blk-cgroup.h >> @@ -88,6 +88,7 @@ struct blkg_policy_data { >> /* the blkg and policy id this per-policy data belongs to */ >> struct blkcg_gq *blkg; >> int plid; >> +boolofflined; >> }; > > This is pure bike-shedding but offlined reads kinda weird to me, maybe > just offline would read better? Other than that, > Do I need to resend a new version for this? Thanks, Joseph > Acked-by: Tejun Heo > > Thanks a lot for seeing this through. >
[PATCH v3, resend] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into
It happens often while I'm preparing a patch for a block driver that I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT available for this driver? Do I have to introduce definitions of these constants before I can use these constants? To avoid this confusion, move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the header file such that these become available for all block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h header file conditional to avoid that including that header file after causes the compiler to complain about a SECTOR_SIZE redefinition. Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have not been removed from uapi header files nor from NAND drivers in which these constants are used for another purpose than converting block layer offsets and sizes into a number of sectors. Signed-off-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Reviewed-by: Martin K. Petersen Cc: Sergey Senozhatsky Cc: David S. Miller Cc: Mike Snitzer Cc: Dan Williams Cc: Minchan Kim Cc: Nitin Gupta --- Changes compared to v2: - Updated Reviewed-by tags. Changes compared to v1: - Changed enums into defines. - Defined SECTOR_SIZE in terms of SECTOR_SHIFT. - Made uapi SECTOR_SIZE definition conditional. arch/xtensa/platforms/iss/simdisk.c | 1 - drivers/block/brd.c | 1 - drivers/block/null_blk.c| 2 -- drivers/block/rbd.c | 9 drivers/block/zram/zram_drv.h | 1 - drivers/ide/ide-cd.c| 8 +++ drivers/ide/ide-cd.h| 6 +- drivers/nvdimm/nd.h | 1 - drivers/scsi/gdth.h | 3 --- include/linux/blkdev.h | 42 +++-- include/linux/device-mapper.h | 2 -- include/linux/ide.h | 1 - include/uapi/linux/msdos_fs.h | 2 ++ 13 files changed, 38 insertions(+), 41 deletions(-) diff --git a/arch/xtensa/platforms/iss/simdisk.c b/arch/xtensa/platforms/iss/simdisk.c index 1b6418407467..026211e7ab09 100644 --- a/arch/xtensa/platforms/iss/simdisk.c +++ b/arch/xtensa/platforms/iss/simdisk.c @@ -21,7 +21,6 @@ #include #define SIMDISK_MAJOR 240 -#define SECTOR_SHIFT 9 #define SIMDISK_MINORS 1 #define MAX_SIMDISK_COUNT 10 diff --git a/drivers/block/brd.c b/drivers/block/brd.c index deea78e485da..66cb0f857f64 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -24,7 +24,6 @@ #include -#define SECTOR_SHIFT 9 #define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) #define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT) diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 0517613afccb..a76553293a31 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -16,10 +16,8 @@ #include #include -#define SECTOR_SHIFT 9 #define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) #define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT) -#define SECTOR_SIZE(1 << SECTOR_SHIFT) #define SECTOR_MASK(PAGE_SECTORS - 1) #define FREE_BATCH 16 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 0016170cde0a..1e03b04819c8 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -50,15 +50,6 @@ #define RBD_DEBUG /* Activate rbd_assert() calls */ -/* - * The basic unit of block I/O is a sector. It is interpreted in a - * number of contexts in Linux (blk, bio, genhd), but the default is - * universally 512 bytes. These symbols are just slightly more - * meaningful than the bare numbers they represent. - */ -#defineSECTOR_SHIFT9 -#defineSECTOR_SIZE (1ULL << SECTOR_SHIFT) - /* * Increment the given counter and return its updated value. * If the counter is already 0 it will not be incremented. diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 31762db861e3..1e9bf65c0bfb 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -37,7 +37,6 @@ static const size_t max_zpage_size = PAGE_SIZE / 4 * 3; /*-- End of configurable params */ -#define SECTOR_SHIFT 9 #define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) #define SECTORS_PER_PAGE (1 << SECTORS_PER_PAGE_SHIFT) #define ZRAM_LOGICAL_BLOCK_SHIFT 12 diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 5613cc2d51fc..5a8e8e3c22cd 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -712,7 +712,7 @@ static ide_startstop_t cdrom_start_rw(ide_drive_t *drive, struct request *rq) struct request_queue *q = drive->queue; int write = rq_data_dir(rq) == WRITE; unsigned short sectors_per_frame = - queue_logical_block_size(q) >> SECTOR_BITS; + queue_logical_block_size(q) >> SECTOR_SHIFT; ide_debug_log(IDE_DBG_RQ, "rq->cmd[0]: 0x%x, rq->cmd_flags: 0x%x, " "secs_per_fra
Re: dm mpath: fix passing integrity data
On Wed, Mar 14 2018 at 10:33am -0400, Steffen Maier wrote: > After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity > data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support > block integrity any more. So add it to the whitelist. > > This is also a pre-requisite to use block integrity with other dm layer(s) > on top of multipath, such as kpartx partitions (dm-linear) or LVM. > > Signed-off-by: Steffen Maier > Bisected-by: Fedor Loshakov > Fixes: e2460f2a4bc7 ("dm: mark targets that pass integrity data") > Cc: #4.12+ > --- > drivers/md/dm-mpath.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 3fde9e9faddd..c174f0c53dc9 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -2023,7 +2023,8 @@ static int multipath_busy(struct dm_target *ti) > static struct target_type multipath_target = { > .name = "multipath", > .version = {1, 12, 0}, > - .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE, > + .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE | > + DM_TARGET_PASSES_INTEGRITY, > .module = THIS_MODULE, > .ctr = multipath_ctr, > .dtr = multipath_dtr, Thanks, I've queued this for 4.16-rc6, will send to Linus tomorrow.
Re: [PATCH 8/8] block: sed-opal: ioctl for writing to shadow mbr
Hi Jonas, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on next-20180309] [cannot apply to linus/master v4.16-rc4 v4.16-rc3 v4.16-rc2 v4.16-rc5] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Jonas-Rabenstein/block-sed-opal-support-write-to-shadow-mbr/20180314-184749 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) block/sed-opal.c:381:20: sparse: incorrect type in assignment (different base types) @@expected unsigned long long [unsigned] [usertype] align @@ got ed long long [unsigned] [usertype] align @@ block/sed-opal.c:381:20:expected unsigned long long [unsigned] [usertype] align block/sed-opal.c:381:20:got restricted __be64 const [usertype] alignment_granularity block/sed-opal.c:382:25: sparse: incorrect type in assignment (different base types) @@expected unsigned long long [unsigned] [usertype] lowest_lba @@got ed long long [unsigned] [usertype] lowest_lba @@ block/sed-opal.c:382:25:expected unsigned long long [unsigned] [usertype] lowest_lba block/sed-opal.c:382:25:got restricted __be64 const [usertype] lowest_aligned_lba >> block/sed-opal.c:1526:58: sparse: incorrect type in argument 2 (different >> address spaces) @@expected void const [noderef] *from @@got >> unsvoid const [noderef] *from @@ block/sed-opal.c:1526:58:expected void const [noderef] *from block/sed-opal.c:1526:58:got unsigned char const [usertype] * >> block/sed-opal.c:2100:14: sparse: incorrect type in argument 1 (different >> address spaces) @@expected void const volatile [noderef] >> * @@got onst volatile [noderef] * @@ block/sed-opal.c:2100:14:expected void const volatile [noderef] * block/sed-opal.c:2100:14:got unsigned char const [usertype] *data vim +1526 block/sed-opal.c 1493 1494 static int write_shadow_mbr(struct opal_dev *dev, void *data) 1495 { 1496 struct opal_shadow_mbr *shadow = data; 1497 size_t off; 1498 u64 len; 1499 int err = 0; 1500 u8 *payload; 1501 1502 /* FIXME: this is the maximum we can use for IO_BUFFER_LENGTH=2048. 1503 *Instead of having constant, it would be nice to compute the 1504 *actual value depending on IO_BUFFER_LENGTH 1505 */ 1506 len = 1950; 1507 1508 /* do the actual transmission(s) */ 1509 for (off = 0 ; off < shadow->size; off += len) { 1510 len = min(len, shadow->size - off); 1511 1512 pr_debug("MBR: write bytes %zu+%llu/%llu\n", 1513 off, len, shadow->size); 1514 err = start_opal_cmd(dev, opaluid[OPAL_MBR], 1515 opalmethod[OPAL_SET]); 1516 add_token_u8(&err, dev, OPAL_STARTNAME); 1517 add_token_u8(&err, dev, OPAL_WHERE); 1518 add_token_u64(&err, dev, shadow->offset + off); 1519 add_token_u8(&err, dev, OPAL_ENDNAME); 1520 1521 add_token_u8(&err, dev, OPAL_STARTNAME); 1522 add_token_u8(&err, dev, OPAL_VALUES); 1523 payload = add_bytestring_header(&err, dev, len); 1524 if (!payload) 1525 break; > 1526 if (copy_from_user(payload, shadow->data + off, len)) 1527 err = -EFAULT; 1528 1529 add_token_u8(&err, dev, OPAL_ENDNAME); 1530 if (err) 1531 break; 1532 1533 err = finalize_and_send(dev, parse_and_check_status); 1534 if (err) 1535 break; 1536 } 1537 return err; 1538 } 1539 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
> P2P over PCI/PCI-X is quite common in devices like raid controllers. Hi Dan Do you mean between PCIe devices below the RAID controller? Isn't it pretty novel to be able to support PCIe EPs below a RAID controller (as opposed to SCSI based devices)? > It would be useful if those configurations were not left behind so > that Linux could feasibly deploy offload code to a controller in the > PCI domain. Agreed. I think this would be great. Kind of like the XCOPY framework that was proposed a while back for SCSI devices [1] but updated to also include NVMe devices. That is definitely a use case we would like this framework to support. Stephen [1] https://lwn.net/Articles/592094/
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On 14/03/18 01:28 PM, Dan Williams wrote: > P2P over PCI/PCI-X is quite common in devices like raid controllers. > It would be useful if those configurations were not left behind so > that Linux could feasibly deploy offload code to a controller in the > PCI domain. Thanks for the note. Neat. In the end nothing is getting left behind it's just work for someone to add support. Even if I wasn't already going to make the change I mentioned it all fits in the architecture and APIs quite easily. Logan
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On Wed, Mar 14, 2018 at 12:03 PM, Logan Gunthorpe wrote: > > > On 14/03/18 12:51 PM, Bjorn Helgaas wrote: >> You are focused on PCIe systems, and in those systems, most topologies >> do have an upstream switch, which means two upstream bridges. I'm >> trying to remove that assumption because I don't think there's a >> requirement for it in the spec. Enforcing this assumption complicates >> the code and makes it harder to understand because the reader says >> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*, >> so why do we need these two bridges?" > > Yes, as I've said, we focused on being behind a single PCIe Switch > because it's easier and vaguely safer (we *know* switches will work but > other types of topology we have to assume will work based on the spec). > Also, I have my doubts that anyone will ever have a use for this with > non-PCIe devices. P2P over PCI/PCI-X is quite common in devices like raid controllers. It would be useful if those configurations were not left behind so that Linux could feasibly deploy offload code to a controller in the PCI domain.
Re: [PATCH v7 8/9] bcache: add io_disable to struct cached_dev
On 02/27/2018 08:55 AM, Coly Li wrote: > If a bcache device is configured to writeback mode, current code does not > handle write I/O errors on backing devices properly. > > In writeback mode, write request is written to cache device, and > latter being flushed to backing device. If I/O failed when writing from > cache device to the backing device, bcache code just ignores the error and > upper layer code is NOT noticed that the backing device is broken. lgtm, applied
Re: [PATCH v7 7/9] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O
LGTM, applied (sorry if this is duplicated, had mail client problems) On 02/27/2018 08:55 AM, Coly Li wrote: > In order to catch I/O error of backing device, a separate bi_end_io > call back is required. Then a per backing device counter can record I/O > errors number and retire the backing device if the counter reaches a > per backing device I/O error limit. > > This patch adds backing_request_endio() to bcache backing device I/O code > path, this is a preparation for further complicated backing device failure > handling. So far there is no real code logic change, I make this change a > separate patch to make sure it is stable and reliable for further work. > > Changelog: > v2: Fix code comments typo, remove a redundant bch_writeback_add() line > added in v4 patch set. > v1: indeed this is new added in this patch set. > > Signed-off-by: Coly Li > Reviewed-by: Hannes Reinecke > Cc: Junhui Tang > Cc: Michael Lyle > --- > drivers/md/bcache/request.c | 93 > +++ > drivers/md/bcache/super.c | 1 + > drivers/md/bcache/writeback.c | 1 + > 3 files changed, 79 insertions(+), 16 deletions(-) > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 279c9266bf50..0c517dd806a5 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl) > } > > op->insert_data_done = true; > + /* get in bch_data_insert() */ > bio_put(bio); > out: > continue_at(cl, bch_data_insert_keys, op->wq); > @@ -630,6 +631,38 @@ static void request_endio(struct bio *bio) > closure_put(cl); > } > > +static void backing_request_endio(struct bio *bio) > +{ > + struct closure *cl = bio->bi_private; > + > + if (bio->bi_status) { > + struct search *s = container_of(cl, struct search, cl); > + /* > + * If a bio has REQ_PREFLUSH for writeback mode, it is > + * speically assembled in cached_dev_write() for a non-zero > + * write request which has REQ_PREFLUSH. we don't set > + * s->iop.status by this failure, the status will be decided > + * by result of bch_data_insert() operation. > + */ > + if (unlikely(s->iop.writeback && > + bio->bi_opf & REQ_PREFLUSH)) { > + char buf[BDEVNAME_SIZE]; > + > + bio_devname(bio, buf); > + pr_err("Can't flush %s: returned bi_status %i", > + buf, bio->bi_status); > + } else { > + /* set to orig_bio->bi_status in bio_complete() */ > + s->iop.status = bio->bi_status; > + } > + s->recoverable = false; > + /* should count I/O error for backing device here */ > + } > + > + bio_put(bio); > + closure_put(cl); > +} > + > static void bio_complete(struct search *s) > { > if (s->orig_bio) { > @@ -644,13 +677,21 @@ static void bio_complete(struct search *s) > } > } > > -static void do_bio_hook(struct search *s, struct bio *orig_bio) > +static void do_bio_hook(struct search *s, > + struct bio *orig_bio, > + bio_end_io_t *end_io_fn) > { > struct bio *bio = &s->bio.bio; > > bio_init(bio, NULL, 0); > __bio_clone_fast(bio, orig_bio); > - bio->bi_end_io = request_endio; > + /* > + * bi_end_io can be set separately somewhere else, e.g. the > + * variants in, > + * - cache_bio->bi_end_io from cached_dev_cache_miss() > + * - n->bi_end_io from cache_lookup_fn() > + */ > + bio->bi_end_io = end_io_fn; > bio->bi_private = &s->cl; > > bio_cnt_set(bio, 3); > @@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio, > s = mempool_alloc(d->c->search, GFP_NOIO); > > closure_init(&s->cl, NULL); > - do_bio_hook(s, bio); > + do_bio_hook(s, bio, request_endio); > > s->orig_bio = bio; > s->cache_miss = NULL; > @@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl) > trace_bcache_read_retry(s->orig_bio); > > s->iop.status = 0; > - do_bio_hook(s, s->orig_bio); > + do_bio_hook(s, s->orig_bio, backing_request_endio); > > /* XXX: invalidate cache */ > > + /* I/O request sent to backing device */ > closure_bio_submit(s->iop.c, bio, cl); > } > > @@ -859,7 +901,7 @@ static int cached_dev_cache_miss(struct btree *b, struct > search *s, > bio_copy_dev(cache_bio, miss); > cache_bio->bi_iter.bi_size = s->insert_bio_sectors << 9; > > - cache_bio->bi_end_io= request_endio; > + cache_bio->bi_end_io= backing_request_endio; > cache_
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On 14/03/18 12:51 PM, Bjorn Helgaas wrote: > You are focused on PCIe systems, and in those systems, most topologies > do have an upstream switch, which means two upstream bridges. I'm > trying to remove that assumption because I don't think there's a > requirement for it in the spec. Enforcing this assumption complicates > the code and makes it harder to understand because the reader says > "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*, > so why do we need these two bridges?" Yes, as I've said, we focused on being behind a single PCIe Switch because it's easier and vaguely safer (we *know* switches will work but other types of topology we have to assume will work based on the spec). Also, I have my doubts that anyone will ever have a use for this with non-PCIe devices. A switch shows up as two or more virtual bridges (per the PCIe v4 Spec 1.3.3) which explains the existing get_upstream_bridge_port() function. In any case, we'll look at generalizing this by looking for a common upstream port in the next revision of the patch set. Logan
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On Wed, Mar 14, 2018 at 10:17:34AM -0600, Logan Gunthorpe wrote: > On 13/03/18 08:56 PM, Bjorn Helgaas wrote: > > I agree that peers need to have a common upstream bridge. I think > > you're saying peers need to have *two* common upstream bridges. If I > > understand correctly, requiring two common bridges is a way to ensure > > that peers directly below Root Ports don't try to DMA to each other. > > No, I don't get where you think we need to have two common upstream > bridges. I'm not sure when such a case would ever happen. But you seem > to understand based on what you wrote below. Sorry, I phrased that wrong. You don't require two common upstream bridges; you require two upstream bridges, with the upper one being common, i.e., static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev) { struct pci_dev *up1, *up2; up1 = pci_dev_get(pci_upstream_bridge(pdev)); up2 = pci_dev_get(pci_upstream_bridge(up1)); return up2; } So if you're starting with pdev, up1 is the immediately upstream bridge and up2 is the second upstream bridge. If this is PCIe, up1 may be a Root Port and there is no up2, or up1 and up2 are in a switch. This is more restrictive than the spec requires. As long as there is a single common upstream bridge, peer-to-peer DMA should work. In fact, in conventional PCI, I think the upstream bridge could even be the host bridge (not a PCI-to-PCI bridge). You are focused on PCIe systems, and in those systems, most topologies do have an upstream switch, which means two upstream bridges. I'm trying to remove that assumption because I don't think there's a requirement for it in the spec. Enforcing this assumption complicates the code and makes it harder to understand because the reader says "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*, so why do we need these two bridges?" [*] For conventional PCI, this means anything below the same host bridge. Two devices on a conventional PCI root bus should be able to DMA to each other, even though there's no PCI-to-PCI bridge above them. For PCIe, it means a "hierarchy domain" as used in PCIe r4.0, sec 1.3.1, i.e., anything below the same Root Port. > > So I guess the first order of business is to nail down whether peers > > below a Root Port are prohibited from DMAing to each other. My > > assumption, based on 6.12.1.2 and the fact that I haven't yet found > > a prohibition, is that they can. > > If you have a multifunction device designed to DMA to itself below a > root port, it can. But determining this is on a device by device basis, > just as determining whether a root complex can do peer to peer is on a > per device basis. So I'd say we don't want to allow it by default and > let someone who has such a device figure out what's necessary if and > when one comes along. It's not the job of this infrastructure to answer the device-dependent question of whether DMA initiators or targets support peer-to-peer DMA. All we want to do here is figure out whether the PCI topology supports it, using the mechanisms guaranteed by the spec. We can derive that from the basic rules about how PCI bridges work, i.e., from the PCI-to-PCI Bridge spec r1.2, sec 4.3: A bridge forwards PCI memory transactions from its primary interface to its secondary interface (downstream) if a memory address is in the range defined by the Memory Base and Memory Limit registers (when the base is less than or equal to the limit) as illustrated in Figure 4-3. Conversely, a memory transaction on the secondary interface that is within this address range will not be forwarded upstream to the primary interface. Any memory transactions on the secondary interface that are outside this address range will be forwarded upstream to the primary interface (provided they are not in the address range defined by the prefetchable memory address range registers). This works for either PCI or PCIe. The only wrinkle PCIe adds is that the very top of the hierarchy is a Root Port, and we can't rely on it to route traffic to other Root Ports. I also doubt Root Complex Integrated Endpoints can participate in peer-to-peer DMA. Thanks for your patience in working through all this. I know it sometimes feels like being bounced around in all directions. It's just a normal consequence of trying to add complex functionality to an already complex system, with interest and expertise spread unevenly across a crowd of people. Bjorn
Re: [PATCH v7 8/9] bcache: add io_disable to struct cached_dev
LGTM, applying. On 02/27/2018 08:55 AM, Coly Li wrote: > If a bcache device is configured to writeback mode, current code does not > handle write I/O errors on backing devices properly. > > In writeback mode, write request is written to cache device, and > latter being flushed to backing device. If I/O failed when writing from > cache device to the backing device, bcache code just ignores the error and > upper layer code is NOT noticed that the backing device is broken. > > This patch tries to handle backing device failure like how the cache device > failure is handled, > - Add a error counter 'io_errors' and error limit 'error_limit' in struct > cached_dev. Add another io_disable to struct cached_dev to disable I/Os > on the problematic backing device. > - When I/O error happens on backing device, increase io_errors counter. And > if io_errors reaches error_limit, set cache_dev->io_disable to true, and > stop the bcache device. > > The result is, if backing device is broken of disconnected, and I/O errors > reach its error limit, backing device will be disabled and the associated > bcache device will be removed from system. > > Changelog: > v2: remove "bcache: " prefix in pr_error(), and use correct name string to > print out bcache device gendisk name. > v1: indeed this is new added in v2 patch set. > > Signed-off-by: Coly Li > Reviewed-by: Hannes Reinecke > Cc: Michael Lyle > Cc: Junhui Tang > --- > drivers/md/bcache/bcache.h | 6 ++ > drivers/md/bcache/io.c | 14 ++ > drivers/md/bcache/request.c | 14 -- > drivers/md/bcache/super.c | 21 + > drivers/md/bcache/sysfs.c | 15 ++- > 5 files changed, 67 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 5e9f3610c6fd..d338b7086013 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -367,6 +367,7 @@ struct cached_dev { > unsignedsequential_cutoff; > unsignedreadahead; > > + unsignedio_disable:1; > unsignedverify:1; > unsignedbypass_torture_test:1; > > @@ -388,6 +389,9 @@ struct cached_dev { > unsignedwriteback_rate_minimum; > > enum stop_on_failurestop_when_cache_set_failed; > +#define DEFAULT_CACHED_DEV_ERROR_LIMIT 64 > + atomic_tio_errors; > + unsignederror_limit; > }; > > enum alloc_reserve { > @@ -911,6 +915,7 @@ static inline void wait_for_kthread_stop(void) > > /* Forward declarations */ > > +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio); > void bch_count_io_errors(struct cache *, blk_status_t, int, const char *); > void bch_bbio_count_io_errors(struct cache_set *, struct bio *, > blk_status_t, const char *); > @@ -938,6 +943,7 @@ int bch_bucket_alloc_set(struct cache_set *, unsigned, >struct bkey *, int, bool); > bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned, > unsigned, unsigned, bool); > +bool bch_cached_dev_error(struct cached_dev *dc); > > __printf(2, 3) > bool bch_cache_set_error(struct cache_set *, const char *, ...); > diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c > index 8013ecbcdbda..7fac97ae036e 100644 > --- a/drivers/md/bcache/io.c > +++ b/drivers/md/bcache/io.c > @@ -50,6 +50,20 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c, > } > > /* IO errors */ > +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio) > +{ > + char buf[BDEVNAME_SIZE]; > + unsigned errors; > + > + WARN_ONCE(!dc, "NULL pointer of struct cached_dev"); > + > + errors = atomic_add_return(1, &dc->io_errors); > + if (errors < dc->error_limit) > + pr_err("%s: IO error on backing device, unrecoverable", > + bio_devname(bio, buf)); > + else > + bch_cached_dev_error(dc); > +} > > void bch_count_io_errors(struct cache *ca, >blk_status_t error, > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 0c517dd806a5..d7a463e0250e 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -637,6 +637,8 @@ static void backing_request_endio(struct bio *bio) > > if (bio->bi_status) { > struct search *s = container_of(cl, struct search, cl); > + struct cached_dev *dc = container_of(s->d, > + struct cached_dev, disk); > /* >* If a bio has REQ_PREFLUSH for writeback mode, it is >* speically assembled in cached_dev_write() for a non-zero > @@ -657,6 +659,7 @@ static void backing_request_endio(struct bio *bio) > } > s->recoverable = false; >
Re: [PATCH v7 7/9] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O
LGTM, applying On 02/27/2018 08:55 AM, Coly Li wrote: > In order to catch I/O error of backing device, a separate bi_end_io > call back is required. Then a per backing device counter can record I/O > errors number and retire the backing device if the counter reaches a > per backing device I/O error limit. > > This patch adds backing_request_endio() to bcache backing device I/O code > path, this is a preparation for further complicated backing device failure > handling. So far there is no real code logic change, I make this change a > separate patch to make sure it is stable and reliable for further work. > > Changelog: > v2: Fix code comments typo, remove a redundant bch_writeback_add() line > added in v4 patch set. > v1: indeed this is new added in this patch set. > > Signed-off-by: Coly Li > Reviewed-by: Hannes Reinecke > Cc: Junhui Tang > Cc: Michael Lyle > --- > drivers/md/bcache/request.c | 93 > +++ > drivers/md/bcache/super.c | 1 + > drivers/md/bcache/writeback.c | 1 + > 3 files changed, 79 insertions(+), 16 deletions(-) > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 279c9266bf50..0c517dd806a5 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl) > } > > op->insert_data_done = true; > + /* get in bch_data_insert() */ > bio_put(bio); > out: > continue_at(cl, bch_data_insert_keys, op->wq); > @@ -630,6 +631,38 @@ static void request_endio(struct bio *bio) > closure_put(cl); > } > > +static void backing_request_endio(struct bio *bio) > +{ > + struct closure *cl = bio->bi_private; > + > + if (bio->bi_status) { > + struct search *s = container_of(cl, struct search, cl); > + /* > + * If a bio has REQ_PREFLUSH for writeback mode, it is > + * speically assembled in cached_dev_write() for a non-zero > + * write request which has REQ_PREFLUSH. we don't set > + * s->iop.status by this failure, the status will be decided > + * by result of bch_data_insert() operation. > + */ > + if (unlikely(s->iop.writeback && > + bio->bi_opf & REQ_PREFLUSH)) { > + char buf[BDEVNAME_SIZE]; > + > + bio_devname(bio, buf); > + pr_err("Can't flush %s: returned bi_status %i", > + buf, bio->bi_status); > + } else { > + /* set to orig_bio->bi_status in bio_complete() */ > + s->iop.status = bio->bi_status; > + } > + s->recoverable = false; > + /* should count I/O error for backing device here */ > + } > + > + bio_put(bio); > + closure_put(cl); > +} > + > static void bio_complete(struct search *s) > { > if (s->orig_bio) { > @@ -644,13 +677,21 @@ static void bio_complete(struct search *s) > } > } > > -static void do_bio_hook(struct search *s, struct bio *orig_bio) > +static void do_bio_hook(struct search *s, > + struct bio *orig_bio, > + bio_end_io_t *end_io_fn) > { > struct bio *bio = &s->bio.bio; > > bio_init(bio, NULL, 0); > __bio_clone_fast(bio, orig_bio); > - bio->bi_end_io = request_endio; > + /* > + * bi_end_io can be set separately somewhere else, e.g. the > + * variants in, > + * - cache_bio->bi_end_io from cached_dev_cache_miss() > + * - n->bi_end_io from cache_lookup_fn() > + */ > + bio->bi_end_io = end_io_fn; > bio->bi_private = &s->cl; > > bio_cnt_set(bio, 3); > @@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio, > s = mempool_alloc(d->c->search, GFP_NOIO); > > closure_init(&s->cl, NULL); > - do_bio_hook(s, bio); > + do_bio_hook(s, bio, request_endio); > > s->orig_bio = bio; > s->cache_miss = NULL; > @@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl) > trace_bcache_read_retry(s->orig_bio); > > s->iop.status = 0; > - do_bio_hook(s, s->orig_bio); > + do_bio_hook(s, s->orig_bio, backing_request_endio); > > /* XXX: invalidate cache */ > > + /* I/O request sent to backing device */ > closure_bio_submit(s->iop.c, bio, cl); > } > > @@ -859,7 +901,7 @@ static int cached_dev_cache_miss(struct btree *b, struct > search *s, > bio_copy_dev(cache_bio, miss); > cache_bio->bi_iter.bi_size = s->insert_bio_sectors << 9; > > - cache_bio->bi_end_io= request_endio; > + cache_bio->bi_end_io= backing_request_endio; > cache_bio->bi_private = &s->cl; > > bch_bio_map(cac
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On 14/03/18 06:16 AM, David Laight wrote: > That surprises me (unless I missed something last time I read the spec). > While P2P writes are relatively easy to handle, reads and any other TLP that > require acks are a completely different proposition. > There are no additional fields that can be set in the read TLP and will be > reflected back in the ack(s) than can be used to route the acks back to the > correct initiator. > > I'm pretty sure that to support P2P reads a switch would have to save > the received read TLP and (possibly later on) issue read TLP of its own > for the required data. > I'm not even sure it is easy to interleave the P2P reads with those > coming from the root. > That requires a potentially infinite queue of pending requests. This is wrong. A completion is a TLP just like any other and makes use of the Destination ID field in the header to route it back to the original requester. > Some x86 root ports support P2P writes (maybe with a bios option). > It would be a shame not to be able to do P2P writes on such systems > even though P2P reads won't work. Yes, and this has been discussed many times. It won't be changing in the near term. Logan
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On 13/03/18 08:56 PM, Bjorn Helgaas wrote: > I assume you want to exclude Root Ports because of multi-function > devices and the "route to self" error. I was hoping for a reference > to that so I could learn more about it. I haven't been able to find where in the spec it forbids route to self. But I was told this by developers who work know switches. Hopefully Stephen can find the reference. But it's a bit of a moot point. Devices can DMA to themselves if they are designed to do so. For example, some NVMe cards can read and write their own CMB for certain types of DMA request. There is a register in the spec (CMBSZ) which specifies which types of requests are supported. (See 3.1.12 in NVMe 1.3a). > I agree that peers need to have a common upstream bridge. I think > you're saying peers need to have *two* common upstream bridges. If I > understand correctly, requiring two common bridges is a way to ensure > that peers directly below Root Ports don't try to DMA to each other. No, I don't get where you think we need to have two common upstream bridges. I'm not sure when such a case would ever happen. But you seem to understand based on what you wrote below. > So I guess the first order of business is to nail down whether peers > below a Root Port are prohibited from DMAing to each other. My > assumption, based on 6.12.1.2 and the fact that I haven't yet found > a prohibition, is that they can. If you have a multifunction device designed to DMA to itself below a root port, it can. But determining this is on a device by device basis, just as determining whether a root complex can do peer to peer is on a per device basis. So I'd say we don't want to allow it by default and let someone who has such a device figure out what's necessary if and when one comes along. > You already have upstream_bridges_match(), which takes two pci_devs. > I think it should walk up the PCI hierarchy from the first device, > checking whether the bridge at each level is also a parent of the > second device. Yes, this is what I meant when I said walking the entire tree. I've been kicking the can down the road on implementing this as getting ref counting right and testing it is going to be quite tricky. The single switch approach we implemented now is just a simplification which works for a single switch. But I guess we can look at implementing it this way for v4. Logan
[PATCH v3] block: bio_check_eod() needs to consider partitions
bio_check_eod() should check partiton size not the whole disk if bio->bi_partno is non-zero. Does this by taking the call to bio_check_eod into blk_partition_remap. Based on an earlier patch from Jiufei Xue. Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index") Reported-by: Jiufei Xue Signed-off-by: Christoph Hellwig --- block/blk-core.c | 93 1 file changed, 40 insertions(+), 53 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 6d82c4f7fadd..47ee24611126 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2023,7 +2023,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) return BLK_QC_T_NONE; } -static void handle_bad_sector(struct bio *bio) +static void handle_bad_sector(struct bio *bio, sector_t maxsector) { char b[BDEVNAME_SIZE]; @@ -2031,7 +2031,7 @@ static void handle_bad_sector(struct bio *bio) printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n", bio_devname(bio, b), bio->bi_opf, (unsigned long long)bio_end_sector(bio), - (long long)get_capacity(bio->bi_disk)); + (long long)maxsector); } #ifdef CONFIG_FAIL_MAKE_REQUEST @@ -2092,68 +2092,59 @@ static noinline int should_fail_bio(struct bio *bio) } ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO); +/* + * Check whether this bio extends beyond the end of the device or partition. + * This may well happen - the kernel calls bread() without checking the size of + * the device, e.g., when mounting a file system. + */ +static inline int bio_check_eod(struct bio *bio, sector_t maxsector) +{ + unsigned int nr_sectors = bio_sectors(bio); + + if (nr_sectors && maxsector && + (nr_sectors > maxsector || +bio->bi_iter.bi_sector > maxsector - nr_sectors)) { + handle_bad_sector(bio, maxsector); + return -EIO; + } + return 0; +} + /* * Remap block n of partition p to block n+start(p) of the disk. */ static inline int blk_partition_remap(struct bio *bio) { struct hd_struct *p; - int ret = 0; + int ret = -EIO; rcu_read_lock(); p = __disk_get_part(bio->bi_disk, bio->bi_partno); - if (unlikely(!p || should_fail_request(p, bio->bi_iter.bi_size) || -bio_check_ro(bio, p))) { - ret = -EIO; + if (unlikely(!p)) + goto out; + if (unlikely(should_fail_request(p, bio->bi_iter.bi_size))) + goto out; + if (unlikely(bio_check_ro(bio, p))) goto out; - } /* * Zone reset does not include bi_size so bio_sectors() is always 0. * Include a test for the reset op code and perform the remap if needed. */ - if (!bio_sectors(bio) && bio_op(bio) != REQ_OP_ZONE_RESET) - goto out; - - bio->bi_iter.bi_sector += p->start_sect; - bio->bi_partno = 0; - trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p), - bio->bi_iter.bi_sector - p->start_sect); - + if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET) { + if (bio_check_eod(bio, part_nr_sects_read(p))) + goto out; + bio->bi_iter.bi_sector += p->start_sect; + bio->bi_partno = 0; + trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p), + bio->bi_iter.bi_sector - p->start_sect); + } + ret = 0; out: rcu_read_unlock(); return ret; } -/* - * Check whether this bio extends beyond the end of the device. - */ -static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors) -{ - sector_t maxsector; - - if (!nr_sectors) - return 0; - - /* Test device or partition size, when known. */ - maxsector = get_capacity(bio->bi_disk); - if (maxsector) { - sector_t sector = bio->bi_iter.bi_sector; - - if (maxsector < nr_sectors || maxsector - nr_sectors < sector) { - /* -* This may well happen - the kernel calls bread() -* without checking the size of the device, e.g., when -* mounting a device. -*/ - handle_bad_sector(bio); - return 1; - } - } - - return 0; -} - static noinline_for_stack bool generic_make_request_checks(struct bio *bio) { @@ -2164,9 +2155,6 @@ generic_make_request_checks(struct bio *bio) might_sleep(); - if (bio_check_eod(bio, nr_sectors)) - goto end_io; - q = bio->bi_disk->queue; if (unlikely(!q)) { printk(KERN_ERR @@ -2186,17 +2174,16 @@ generic_make_request
Re: [PATCH] dm mpath: fix passing integrity data
Steffen, > After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity > data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support > block integrity any more. So add it to the whitelist. Ugh. Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] dm mpath: fix passing integrity data
On 03/14/2018 03:33 PM, Steffen Maier wrote: > After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity > data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support > block integrity any more. So add it to the whitelist. > > This is also a pre-requisite to use block integrity with other dm layer(s) > on top of multipath, such as kpartx partitions (dm-linear) or LVM. > > Signed-off-by: Steffen Maier > Bisected-by: Fedor Loshakov > Fixes: e2460f2a4bc7 ("dm: mark targets that pass integrity data") > Cc: #4.12+ > --- > drivers/md/dm-mpath.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 3fde9e9faddd..c174f0c53dc9 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -2023,7 +2023,8 @@ static int multipath_busy(struct dm_target *ti) > static struct target_type multipath_target = { > .name = "multipath", > .version = {1, 12, 0}, > - .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE, > + .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE | > + DM_TARGET_PASSES_INTEGRITY, > .module = THIS_MODULE, > .ctr = multipath_ctr, > .dtr = multipath_dtr, > Ho-hum. Thanks for this. Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2] block: bio_check_eod() needs to consider partition
On Wed, 2018-03-14 at 14:03 +0100, h...@lst.de wrote: > can you test the version below? Hello Christoph, The same VM that failed to boot with v2 of this patch boots fine with this patch. Thanks, Bart.
Re: [PATCH V5 1/5] scsi: hpsa: fix selection of reply queue
On Tue, 2018-03-13 at 17:42 +0800, Ming Lei wrote: > From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs), > one msix vector can be created without any online CPU mapped, then one > command's completion may not be notified. > > This patch setups mapping between cpu and reply queue according to irq > affinity info retrived by pci_irq_get_affinity(), and uses this mapping > table to choose reply queue for queuing one command. > > Then the chosen reply queue has to be active, and fixes IO hang caused > by using inactive reply queue which doesn't have any online CPU mapped. > > Cc: Hannes Reinecke > Cc: "Martin K. Petersen" , > Cc: James Bottomley , > Cc: Christoph Hellwig , > Cc: Don Brace > Cc: Kashyap Desai > Cc: Laurence Oberman > Cc: Meelis Roos > Cc: Artem Bityutskiy > Cc: Mike Snitzer > Tested-by: Laurence Oberman > Tested-by: Don Brace > Tested-by: Artem Bityutskiy > Acked-by: Don Brace > Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs") > Signed-off-by: Ming Lei Checked v5 my Skylake Xeon and with this patch the regression that I reported is fixed. Tested-by: Artem Bityutskiy Link: https://lkml.kernel.org/r/1519311270.2535.53.ca...@intel.com - Intel Finland Oy Registered Address: PL 281, 00181 Helsinki Business Identity Code: 0357606 - 4 Domiciled in Helsinki This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: [PATCH V5 2/5] scsi: megaraid_sas: fix selection of reply queue
On Tue, 2018-03-13 at 17:42 +0800, Ming Lei wrote: > From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs), > one msix vector can be created without any online CPU mapped, then > command may be queued, and won't be notified after its completion. > > This patch setups mapping between cpu and reply queue according to irq > affinity info retrived by pci_irq_get_affinity(), and uses this info > to choose reply queue for queuing one command. > > Then the chosen reply queue has to be active, and fixes IO hang caused > by using inactive reply queue which doesn't have any online CPU mapped. > > Cc: Hannes Reinecke > Cc: "Martin K. Petersen" , > Cc: James Bottomley , > Cc: Christoph Hellwig , > Cc: Don Brace > Cc: Kashyap Desai > Cc: Laurence Oberman > Cc: Mike Snitzer > Cc: Meelis Roos > Cc: Artem Bityutskiy > Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs") > Signed-off-by: Ming Lei Checked v5 my Skylake Xeon and with this patch the regression that I reported is fixed. Tested-by: Artem Bityutskiy Link: https://lkml.kernel.org/r/1519311270.2535.53.ca...@intel.com
[PATCH 11/16] treewide: simplify Kconfig dependencies for removed archs
A lot of Kconfig symbols have architecture specific dependencies. In those cases that depend on architectures we have already removed, they can be omitted. Signed-off-by: Arnd Bergmann --- block/bounce.c | 2 +- drivers/ide/Kconfig | 2 +- drivers/ide/ide-generic.c| 12 +--- drivers/input/joystick/analog.c | 2 +- drivers/isdn/hisax/Kconfig | 10 +- drivers/net/ethernet/davicom/Kconfig | 2 +- drivers/net/ethernet/smsc/Kconfig| 6 +++--- drivers/net/wireless/cisco/Kconfig | 2 +- drivers/pwm/Kconfig | 2 +- drivers/rtc/Kconfig | 2 +- drivers/spi/Kconfig | 4 ++-- drivers/usb/musb/Kconfig | 2 +- drivers/video/console/Kconfig| 3 +-- drivers/watchdog/Kconfig | 6 -- drivers/watchdog/Makefile| 6 -- fs/Kconfig.binfmt| 5 ++--- fs/minix/Kconfig | 2 +- include/linux/ide.h | 7 +-- init/Kconfig | 5 ++--- lib/Kconfig.debug| 13 + lib/test_user_copy.c | 2 -- mm/Kconfig | 7 --- mm/percpu.c | 4 23 files changed, 31 insertions(+), 77 deletions(-) diff --git a/block/bounce.c b/block/bounce.c index 6a3e68292273..dd0b93f2a871 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -31,7 +31,7 @@ static struct bio_set *bounce_bio_set, *bounce_bio_split; static mempool_t *page_pool, *isa_page_pool; -#if defined(CONFIG_HIGHMEM) || defined(CONFIG_NEED_BOUNCE_POOL) +#if defined(CONFIG_HIGHMEM) static __init int init_emergency_pool(void) { #if defined(CONFIG_HIGHMEM) && !defined(CONFIG_MEMORY_HOTPLUG) diff --git a/drivers/ide/Kconfig b/drivers/ide/Kconfig index cf1fb3fb5d26..901b8833847f 100644 --- a/drivers/ide/Kconfig +++ b/drivers/ide/Kconfig @@ -200,7 +200,7 @@ comment "IDE chipset support/bugfixes" config IDE_GENERIC tristate "generic/default IDE chipset support" - depends on ALPHA || X86 || IA64 || M32R || MIPS || ARCH_RPC + depends on ALPHA || X86 || IA64 || MIPS || ARCH_RPC default ARM && ARCH_RPC help This is the generic IDE driver. This driver attaches to the diff --git a/drivers/ide/ide-generic.c b/drivers/ide/ide-generic.c index 54d7c4685d23..80c0d69b83ac 100644 --- a/drivers/ide/ide-generic.c +++ b/drivers/ide/ide-generic.c @@ -13,13 +13,10 @@ #include #include -/* FIXME: convert arm and m32r to use ide_platform host driver */ +/* FIXME: convert arm to use ide_platform host driver */ #ifdef CONFIG_ARM #include #endif -#ifdef CONFIG_M32R -#include -#endif #define DRV_NAME "ide_generic" @@ -35,13 +32,6 @@ static const struct ide_port_info ide_generic_port_info = { #ifdef CONFIG_ARM static const u16 legacy_bases[] = { 0x1f0 }; static const int legacy_irqs[] = { IRQ_HARDDISK }; -#elif defined(CONFIG_PLAT_M32700UT) || defined(CONFIG_PLAT_MAPPI2) || \ - defined(CONFIG_PLAT_OPSPUT) -static const u16 legacy_bases[] = { 0x1f0 }; -static const int legacy_irqs[] = { PLD_IRQ_CFIREQ }; -#elif defined(CONFIG_PLAT_MAPPI3) -static const u16 legacy_bases[] = { 0x1f0, 0x170 }; -static const int legacy_irqs[] = { PLD_IRQ_CFIREQ, PLD_IRQ_IDEIREQ }; #elif defined(CONFIG_ALPHA) static const u16 legacy_bases[] = { 0x1f0, 0x170, 0x1e8, 0x168 }; static const int legacy_irqs[] = { 14, 15, 11, 10 }; diff --git a/drivers/input/joystick/analog.c b/drivers/input/joystick/analog.c index be1b4921f22a..eefac7978f93 100644 --- a/drivers/input/joystick/analog.c +++ b/drivers/input/joystick/analog.c @@ -163,7 +163,7 @@ static unsigned int get_time_pit(void) #define GET_TIME(x)do { x = (unsigned int)rdtsc(); } while (0) #define DELTA(x,y) ((y)-(x)) #define TIME_NAME "TSC" -#elif defined(__alpha__) || defined(CONFIG_ARM) || defined(CONFIG_ARM64) || defined(CONFIG_RISCV) || defined(CONFIG_TILE) +#elif defined(__alpha__) || defined(CONFIG_ARM) || defined(CONFIG_ARM64) || defined(CONFIG_RISCV) #define GET_TIME(x)do { x = get_cycles(); } while (0) #define DELTA(x,y) ((y)-(x)) #define TIME_NAME "get_cycles" diff --git a/drivers/isdn/hisax/Kconfig b/drivers/isdn/hisax/Kconfig index eb83d94ab4fe..38cfc8baae19 100644 --- a/drivers/isdn/hisax/Kconfig +++ b/drivers/isdn/hisax/Kconfig @@ -109,7 +109,7 @@ config HISAX_16_3 config HISAX_TELESPCI bool "Teles PCI" - depends on PCI && (BROKEN || !(SPARC || PPC || PARISC || M68K || (MIPS && !CPU_LITTLE_ENDIAN) || FRV || (XTENSA && !CPU_LITTLE_ENDIAN))) + depends on PCI && (BROKEN || !(SPARC || PPC || PARISC || M68K || (MIPS && !CPU_LITTLE_ENDIAN) || (XTENSA && !CPU_LITTLE_ENDIAN))) help This enables HiSax support for the Teles PCI. See on how to configure it. @@ -237,7 +237,7 @@ config HISAX_MIC config HISAX_NETJET bool "NET
[PATCH 00/16] remove eight obsolete architectures
Here is the collection of patches I have applied to my 'asm-generic' tree on top of the 'metag' removal. This does not include any of the device drivers, I'll send those separately to a someone different list of people. The removal came out of a discussion that is now documented at https://lwn.net/Articles/748074/ Following up from the state described there, I ended up removing the mn10300, tile, blackfin and cris architectures directly, rather than waiting, after consulting with the respective maintainers. However, the unicore32 architecture is no longer part of the removal, after its maintainer Xuetao Guan said that the port is still actively being used and that he intends to keep working on it, and that he will try to provide updated toolchain sources. In the end, it seems that while the eight architectures are extremely different, they all suffered the same fate: There was one company in charge of an SoC line, a CPU microarchitecture and a software ecosystem, which was more costly than licensing newer off-the-shelf CPU cores from a third party (typically ARM, MIPS, or RISC-V). It seems that all the SoC product lines are still around, but have not used the custom CPU architectures for several years at this point. Arnd Arnd Bergmann (14): arch: remove frv port arch: remove m32r port arch: remove score port arch: remove blackfin port arch: remove tile port procfs: remove CONFIG_HARDWALL dependency mm: remove blackfin MPU support mm: remove obsolete alloc_remap() treewide: simplify Kconfig dependencies for removed archs asm-generic: siginfo: remove obsolete #ifdefs Documentation: arch-support: remove obsolete architectures asm-generic: clean up asm/unistd.h recordmcount.pl: drop blackin and tile support ktest: remove obsolete architectures David Howells (1): mn10300: Remove the architecture Jesper Nilsson (1): CRIS: Drop support for the CRIS port Dirstat only (full diffstat is over 100KB): 6.3% arch/blackfin/mach-bf548/include/mach/ 4.5% arch/blackfin/mach-bf609/include/mach/ 26.3% arch/blackfin/ 4.1% arch/cris/arch-v32/ 5.6% arch/cris/include/arch-v32/arch/hwregs/iop/ 4.1% arch/cris/include/arch-v32/mach-a3/mach/hwregs/ 4.7% arch/cris/include/arch-v32/ 7.8% arch/cris/ 5.6% arch/frv/ 5.5% arch/m32r/ 7.0% arch/mn10300/ 7.6% arch/tile/include/ 6.4% arch/tile/kernel/ 0.0% Documentation/admin-guide/ 0.0% Documentation/blackfin/ 0.0% Documentation/cris/ 0.0% Documentation/devicetree/bindings/cris/ 0.0% Documentation/devicetree/bindings/interrupt-controller/ 2.8% Documentation/features/ 0.5% Documentation/frv/ 0.0% Documentation/ioctl/ 0.0% Documentation/mn10300/ 0.0% Documentation/ 0.0% block/ 0.0% crypto/ 0.0% drivers/ide/ 0.0% drivers/input/joystick/ 0.0% drivers/isdn/hisax/ 0.0% drivers/net/ethernet/davicom/ 0.0% drivers/net/ethernet/smsc/ 0.0% drivers/net/wireless/cisco/ 0.0% drivers/pci/ 0.0% drivers/pwm/ 0.0% drivers/rtc/ 0.0% drivers/spi/ 0.0% drivers/staging/speakup/ 0.0% drivers/usb/musb/ 0.0% drivers/video/console/ 0.0% drivers/watchdog/ 0.0% fs/minix/ 0.0% fs/proc/ 0.0% fs/ 0.0% include/asm-generic/ 0.0% include/linux/ 0.0% include/uapi/asm-generic/ 0.0% init/ 0.0% kernel/ 0.0% lib/ 0.0% mm/ 0.0% samples/blackfin/ 0.0% samples/kprobes/ 0.0% samples/ 0.0% scripts/mod/ 0.0% scripts/ 0.0% tools/arch/frv/include/uapi/asm/ 0.0% tools/arch/m32r/include/uapi/asm/ 0.0% tools/arch/mn10300/include/uapi/asm/ 0.0% tools/arch/score/include/uapi/asm/ 0.0% tools/arch/tile/include/asm/ 0.0% tools/arch/tile/include/uapi/asm/ 0.0% tools/include/asm-generic/ 0.0% tools/scripts/ 0.0% tools/testing/ktest/examples/ 0.0% tools/testing/ktest/ Cc: linux-...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-in...@vger.kernel.org Cc: net...@vger.kernel.org Cc: linux-wirel...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linux-fb...@vger.kernel.org Cc: linux-watch...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux-a...@vger.kernel.org Cc: linux...@kvack.org
[PATCH] dm mpath: fix passing integrity data
After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support block integrity any more. So add it to the whitelist. This is also a pre-requisite to use block integrity with other dm layer(s) on top of multipath, such as kpartx partitions (dm-linear) or LVM. Signed-off-by: Steffen Maier Bisected-by: Fedor Loshakov Fixes: e2460f2a4bc7 ("dm: mark targets that pass integrity data") Cc: #4.12+ --- drivers/md/dm-mpath.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 3fde9e9faddd..c174f0c53dc9 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -2023,7 +2023,8 @@ static int multipath_busy(struct dm_target *ti) static struct target_type multipath_target = { .name = "multipath", .version = {1, 12, 0}, - .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE, + .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE | + DM_TARGET_PASSES_INTEGRITY, .module = THIS_MODULE, .ctr = multipath_ctr, .dtr = multipath_dtr, -- 2.13.5
Re: [PATCH v5] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()
Hello, On Wed, Mar 14, 2018 at 02:18:04PM +0800, Joseph Qi wrote: > Fixes: ae1188963611 ("blkcg: consolidate blkg creation in > blkcg_bio_issue_check()") > Reported-by: Jiufei Xue > Cc: sta...@vger.kernel.org #4.3+ I'm a bit nervous about tagging it for -stable. Given the low rate of this actually occurring, I'm not sure the benefits outweigh the risks. Let's at least cook it for a couple releases before sending it to -stable. > diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h > index 69bea82..dccd102 100644 > --- a/include/linux/blk-cgroup.h > +++ b/include/linux/blk-cgroup.h > @@ -88,6 +88,7 @@ struct blkg_policy_data { > /* the blkg and policy id this per-policy data belongs to */ > struct blkcg_gq *blkg; > int plid; > + boolofflined; > }; This is pure bike-shedding but offlined reads kinda weird to me, maybe just offline would read better? Other than that, Acked-by: Tejun Heo Thanks a lot for seeing this through. -- tejun
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
>I assume you want to exclude Root Ports because of multi-function > devices and the "route to self" error. I was hoping for a reference > to that so I could learn more about it. Apologies Bjorn. This slipped through my net. I will try and get you a reference for RTS in the next couple of days. > While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS > Functions in SR-IOV Capable and Multi-Function Devices", which seems > relevant. It talks about "peer-to-peer Requests (between Functions of > the device)". Thay says to me that multi-function devices can DMA > between themselves. I will go take a look. Appreciate the link. Stephen
Re: [PATCH v2] block: bio_check_eod() needs to consider partition
Hi Bart, can you test the version below? --- >From a68a8518158e31d66a0dc4f4e795ca3ceb83752c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 13 Mar 2018 09:27:30 +0100 Subject: block: bio_check_eod() needs to consider partition bio_check_eod() should check partiton size not the whole disk if bio->bi_partno is non-zero. Does this by taking the call to bio_check_eod into blk_partition_remap. Based on an earlier patch from Jiufei Xue. Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index") Reported-by: Jiufei Xue Signed-off-by: Christoph Hellwig --- block/blk-core.c | 93 1 file changed, 40 insertions(+), 53 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 6d82c4f7fadd..47ee24611126 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2023,7 +2023,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) return BLK_QC_T_NONE; } -static void handle_bad_sector(struct bio *bio) +static void handle_bad_sector(struct bio *bio, sector_t maxsector) { char b[BDEVNAME_SIZE]; @@ -2031,7 +2031,7 @@ static void handle_bad_sector(struct bio *bio) printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n", bio_devname(bio, b), bio->bi_opf, (unsigned long long)bio_end_sector(bio), - (long long)get_capacity(bio->bi_disk)); + (long long)maxsector); } #ifdef CONFIG_FAIL_MAKE_REQUEST @@ -2092,68 +2092,59 @@ static noinline int should_fail_bio(struct bio *bio) } ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO); +/* + * Check whether this bio extends beyond the end of the device or partition. + * This may well happen - the kernel calls bread() without checking the size of + * the device, e.g., when mounting a file system. + */ +static inline int bio_check_eod(struct bio *bio, sector_t maxsector) +{ + unsigned int nr_sectors = bio_sectors(bio); + + if (nr_sectors && maxsector && + (nr_sectors > maxsector || +bio->bi_iter.bi_sector > maxsector - nr_sectors)) { + handle_bad_sector(bio, maxsector); + return -EIO; + } + return 0; +} + /* * Remap block n of partition p to block n+start(p) of the disk. */ static inline int blk_partition_remap(struct bio *bio) { struct hd_struct *p; - int ret = 0; + int ret = -EIO; rcu_read_lock(); p = __disk_get_part(bio->bi_disk, bio->bi_partno); - if (unlikely(!p || should_fail_request(p, bio->bi_iter.bi_size) || -bio_check_ro(bio, p))) { - ret = -EIO; + if (unlikely(!p)) + goto out; + if (unlikely(should_fail_request(p, bio->bi_iter.bi_size))) + goto out; + if (unlikely(bio_check_ro(bio, p))) goto out; - } /* * Zone reset does not include bi_size so bio_sectors() is always 0. * Include a test for the reset op code and perform the remap if needed. */ - if (!bio_sectors(bio) && bio_op(bio) != REQ_OP_ZONE_RESET) - goto out; - - bio->bi_iter.bi_sector += p->start_sect; - bio->bi_partno = 0; - trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p), - bio->bi_iter.bi_sector - p->start_sect); - + if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET) { + if (bio_check_eod(bio, part_nr_sects_read(p))) + goto out; + bio->bi_iter.bi_sector += p->start_sect; + bio->bi_partno = 0; + trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p), + bio->bi_iter.bi_sector - p->start_sect); + } + ret = 0; out: rcu_read_unlock(); return ret; } -/* - * Check whether this bio extends beyond the end of the device. - */ -static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors) -{ - sector_t maxsector; - - if (!nr_sectors) - return 0; - - /* Test device or partition size, when known. */ - maxsector = get_capacity(bio->bi_disk); - if (maxsector) { - sector_t sector = bio->bi_iter.bi_sector; - - if (maxsector < nr_sectors || maxsector - nr_sectors < sector) { - /* -* This may well happen - the kernel calls bread() -* without checking the size of the device, e.g., when -* mounting a device. -*/ - handle_bad_sector(bio); - return 1; - } - } - - return 0; -} - static noinline_for_stack bool generic_make_request_checks(struct bio *bio) { @@ -2164,9 +2155,6 @@ generic_make_request_checks(struct bio *b
RE: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
From: Logan Gunthorpe > Sent: 13 March 2018 23:46 ... > As Stephen pointed out, it's a requirement of the PCIe spec that a > switch supports P2P. If you want to sell a switch that does P2P with bad > performance then that's on you to deal with. That surprises me (unless I missed something last time I read the spec). While P2P writes are relatively easy to handle, reads and any other TLP that require acks are a completely different proposition. There are no additional fields that can be set in the read TLP and will be reflected back in the ack(s) than can be used to route the acks back to the correct initiator. I'm pretty sure that to support P2P reads a switch would have to save the received read TLP and (possibly later on) issue read TLP of its own for the required data. I'm not even sure it is easy to interleave the P2P reads with those coming from the root. That requires a potentially infinite queue of pending requests. Some x86 root ports support P2P writes (maybe with a bios option). It would be a shame not to be able to do P2P writes on such systems even though P2P reads won't work. (We looked at using P2P transfers for some data, but in the end used a different scheme. For our use case P2P writes were enough. An alternative would be to access the same host memory buffer from two different devices - but there isn't an API that lets you do that.) David
Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible
Hi Artern, At 03/14/2018 05:07 PM, Artem Bityutskiy wrote: On Wed, 2018-03-14 at 12:11 +0800, Dou Liyang wrote: At 03/13/2018 05:35 PM, Rafael J. Wysocki wrote: On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy Longer term, yeah, I agree. Kernel's notion of possible CPU count should be realistic. I did a patch for that, Artem, could you help me to test it. I didn't consider the nr_cpu_ids before. please ignore the old patch and try the following RFC patch. Sure I can help with testing a patch, could we please: 1. Start a new thread for this 2. Include ACPI forum/folks OK, I will do that right now. Thanks, dou Thanks, Artem.
Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem
On Tue 06-03-18 20:28:59, Tetsuo Handa wrote: > Laura Abbott wrote: > > On 02/26/2018 06:28 AM, Michal Hocko wrote: > > > On Fri 23-02-18 11:51:41, Laura Abbott wrote: > > >> Hi, > > >> > > >> The Fedora arm-32 build VMs have a somewhat long standing problem > > >> of hanging when running mkfs.ext4 with a bunch of processes stuck > > >> in D state. This has been seen as far back as 4.13 but is still > > >> present on 4.14: > > >> > > > [...] > > >> This looks like everything is blocked on the writeback completing but > > >> the writeback has been throttled. According to the infra team, this > > >> problem > > >> is _not_ seen without LPAE (i.e. only 4G of RAM). I did see > > >> https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to > > >> quite match since this seems to be completely stuck. Any suggestions to > > >> narrow the problem down? > > > > > > How much dirtyable memory does the system have? We do allow only lowmem > > > to be dirtyable by default on 32b highmem systems. Maybe you have the > > > lowmem mostly consumed by the kernel memory. Have you tried to enable > > > highmem_is_dirtyable? > > > > > > > Setting highmem_is_dirtyable did fix the problem. The infrastructure > > people seemed satisfied enough with this (and are happy to have the > > machines back). > > That's good. > > > I'll see if they are willing to run a few more tests > > to get some more state information. > > Well, I'm far from understanding what is happening in your case, but I'm > interested in other threads which were trying to allocate memory. Therefore, > I appreciate if they can take SysRq-m + SysRq-t than SysRq-w (as described > at http://akari.osdn.jp/capturing-kernel-messages.html ). > > Code which assumes that kswapd can make progress can get stuck when kswapd > is blocked somewhere. And wbt_wait() seems to change behavior based on > current_is_kswapd(). If everyone is waiting for kswapd but kswapd cannot > make progress, I worry that it leads to hangups like your case. Tetsuo, could you stop this finally, pretty please? This is a well known limitation of 32b architectures with more than 4G. The lowmem can only handle 896MB of memory and that can be filled up with other kernel allocations. Stalled writeback is _usually_ a result of only little dirtyable memory which is left in the lowmem. We cannot simply allow highmem to be dirtyable by default due to reasons explained in other email. I can imagine that it is hard for you to grasp that not everything is "silent hang during OOM" but there are other things going on in the VM. -- Michal Hocko SUSE Labs
Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem
On Mon 05-03-18 13:04:24, Laura Abbott wrote: > On 02/26/2018 06:28 AM, Michal Hocko wrote: > > On Fri 23-02-18 11:51:41, Laura Abbott wrote: > > > Hi, > > > > > > The Fedora arm-32 build VMs have a somewhat long standing problem > > > of hanging when running mkfs.ext4 with a bunch of processes stuck > > > in D state. This has been seen as far back as 4.13 but is still > > > present on 4.14: > > > > > [...] > > > This looks like everything is blocked on the writeback completing but > > > the writeback has been throttled. According to the infra team, this > > > problem > > > is _not_ seen without LPAE (i.e. only 4G of RAM). I did see > > > https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to > > > quite match since this seems to be completely stuck. Any suggestions to > > > narrow the problem down? > > > > How much dirtyable memory does the system have? We do allow only lowmem > > to be dirtyable by default on 32b highmem systems. Maybe you have the > > lowmem mostly consumed by the kernel memory. Have you tried to enable > > highmem_is_dirtyable? > > > > Setting highmem_is_dirtyable did fix the problem. The infrastructure > people seemed satisfied enough with this (and are happy to have the > machines back). I'll see if they are willing to run a few more tests > to get some more state information. Please be aware that highmem_is_dirtyable is not for free. There are some code paths which can only allocate from lowmem (e.g. block device AFAIR) and those could fill up the whole lowmem without any throttling. -- Michal Hocko SUSE Labs
Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Wed, 2018-03-14 at 12:11 +0800, Dou Liyang wrote: > > At 03/13/2018 05:35 PM, Rafael J. Wysocki wrote: > > > On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy > > > > Longer term, yeah, I agree. Kernel's notion of possible CPU > > > > count > > > > should be realistic. > > > > I did a patch for that, Artem, could you help me to test it. > > > > I didn't consider the nr_cpu_ids before. please ignore the old patch > and > try the following RFC patch. Sure I can help with testing a patch, could we please: 1. Start a new thread for this 2. Include ACPI forum/folks Thanks, Artem.
Re: [PATCH V5 5/5] scsi: virtio_scsi: unify scsi_host_template
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH V5 4/5] scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH V5 1/5] scsi: hpsa: fix selection of reply queue
I still don't like the code duplication, but I guess I can fix this up in one of the next merge windows myself.. Reviewed-by: Christoph Hellwig
Re: [PATCH V5 2/5] scsi: megaraid_sas: fix selection of reply queue
Same as for hpsa.. Reviewed-by: Christoph Hellwig