from:"Elliott, Robert \(Server Storage\)"

RE: [patch] Revert "block: remove artifical max_hw_sectors cap"

2015-07-29 Thread Elliott, Robert (Server Storage)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Jeff Moyer
> Sent: Wednesday, July 29, 2015 11:53 AM
> To: Christoph Hellwig 
> Cc: Jens Axboe ; linux-kernel@vger.kernel.org;
> dmilb...@redhat.com

Adding linux-scsi...

> Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"
> 
> Christoph Hellwig  writes:
> 
> > On Mon, Jul 20, 2015 at 03:17:07PM -0400, Jeff Moyer wrote:
> >> For SAN storage, we've seen initial write and re-write performance drop
> >> 25-50% across all I/O sizes.  On locally attached storage, we've seen
> >> regressions of 40% for all I/O types, but only for I/O sizes larger
> than
> >> 1MB.
> >
> > Workload, and hardare please.  An only mainline numbers, not some old
> > hacked vendor kernel, please.
> 
> I've attached a simple fio config that reproduces the problem.  It just
> does sequential, O_DIRECT write I/O with I/O sizes of 1M, 2M and 4M.  So
> far I've tested it on an HP HSV400 and an IBM XIV SAN array connected
> via a qlogic adapter, a nearline sata driveand a WD Red (NAS) sata disk
> connected via an intel ich9r sata controller.  The kernel I tested was
> 4.2.0-rc3, and the testing was done across 3 different hosts (just
> because I don't have all the hardware connected to a single box).  I did
> 10 runs using max_sectors_kb set to 1024, and 10 runs with it set to
> 32767.  Results compare the averages of those 10 runs.  In no cases did
> I see a performance gain.  In two cases, there is a performance hit.
> 
> In addition to my testing, our performance teams have seen performance
> regressions running iozone on fibre channel-attached HP MSA1000 storage,
> as well as on an SSD hidden behind a megaraid controller.  I was not
> able to get the exact details on the SSD.  iozone configurations can be
> provided, but I think I've nailed the underlying problem with this test
> case.
> 
> But, don't take my word for it.  Run the fio script on your own
> hardware.  All you have to do is echo a couple of values into
> /sys/block/sdX/queue/max_sectors_kb to test, no kernel rebuilding
> required.
> 
> In the tables below, concentrate on the BW/IOPS numbers under the WRITE
> column.  Negative numbers indicate that max_sectors_kb of 32767 shows a
> performance regression of the indicated percentage when compared with a
> setting of 1024.
> 
> Christoph, did you have some hardware where a higher max_sectors_kb
> improved performance?

I don't still have performance numbers, but the old default of 
512 KiB was interfering with building large writes that RAID
controllers can treat as full stripe writes (avoiding the need
to read the old parity).

With the SCSI LLD value bumped up, some other limits remain:
* the block layer BIO_MAX_PAGES value of 256 limits IOs
  to a maximum of 1 MiB (bio chaining affects this too)
* the SCSI LLD maximum number of scatter gather entries
  reported in /sys/block/sdNN/queue/max_segments and
  /sys/block/sdNN/queue/max_segment_size creates a
  limit based on how fragmented the data buffer is
  in virtual memory
* the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
  indicates the maximum transfer size for one command over
  the SCSI transport protocol supported by the drive itself

The patch let 1 MiB IOs flow through the stack, which
is a better fit for modern strip sizes than 512 KiB.

Software using large IOs must be prepared for long 
latencies in exchange for the potential bandwidth gains,
and must use a low (but greater than 1) queue depth 
to keep the IOs flowing back-to-back.

Are you finding real software generating such IOs
but relying on the storage stack to break them up
for decent performance?

Your fio script is using the sync IO engine, which
means no queuing.  This forces a turnaround time 
between IOs, preventing the device from looking ahead
to see what's next (for sequential IOs, probably
continuing data transfers with minimal delay).

If the storage stack breaks up large sync IOs, the 
drive might be better at detecting that the access
pattern is sequential (e.g., the gaps are between 
every set of 2 IOs rather than every IO).  This is
very drive-specific.

If we have to go back to that artificial limit, then
modern drivers (e.g., blk-mq capable drivers) need a
way to raise the default; relying on users to change
the sysfs settings means they're usually not changed.

---
Robert Elliott, HP Server Storage


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] target: add support for START_STOP_UNIT SCSI opcode

2015-07-23 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Spencer Baugh
> Sent: Thursday, July 23, 2015 5:28 PM
> To: Christoph Hellwig ; Spencer Baugh
> 
...
> Subject: Re: [PATCH] target: add support for START_STOP_UNIT SCSI opcode
> 
> From: Brian Bunker 
> 
> AIX servers using VIOS servers that virtualize FC cards will have a
> problem booting without support for START_STOP_UNIT.
> 
> v2: Cite sb3r36 exactly, clean up if conditions
> 
> Signed-off-by: Brian Bunker 
> Signed-off-by: Spencer Baugh 
...
> v2: Cite sb3r36 exactly, clean up if conditions
> 
...
> diff --git a/drivers/target/target_core_sbc.c
> b/drivers/target/target_core_sbc.c
> index e318ddb..85c3c0a 100644
> --- a/drivers/target/target_core_sbc.c
> +++ b/drivers/target/target_core_sbc.c
> @@ -154,6 +154,38 @@ sbc_emulate_readcapacity_16(struct se_cmd *cmd)
>   return 0;
>  }
> 
> +static sense_reason_t
> +sbc_emulate_startstop(struct se_cmd *cmd)
> +{
> + unsigned char *cdb = cmd->t_task_cdb;
> +
> + /*
> +  * See sb3r36 section 5.25

Global typo in this patch - sb3 should be sbc3.

Editorial comment:
Note that the officially published versions of the ISO and ANSI 
standards don't carry that revision number r36; they just have
the standard name and year.  SBC-3 revision 36 became 
"ANSI INCITS 514-2014 Information technology - SCSI Block 
Commands - 3 (SBC-3)".

T10 isn't really obligated to keep making particular working 
drafts available, although the ones that have been assigned
version descriptors (in SPC-n) are more likely to stick 
around.  For SBC-3, only revisions 35 and 36 earned those.

Section numbers are quite volatile, too, so you might be
better off including the section name.  It's starting to
become standardese, but this kind of wording might be better:

"See the SBC-3 START STOP UNIT command description
(e.g., sbc3r36 5.25)"

> + /* From SBC-3:
> +  * Immediate bit should be set since there is nothing to complete
> +  * POWER CONDITION MODIFIER 0h
> +  */
> + if (!(cdb[1] & 1) || (cdb[2] | cdb[3]))
> + return TCM_INVALID_CDB_FIELD;

Technical comment:
The application client is not obligated to set the IMMED bit here.
In fact, that's an unusual choice.  Using the IMMED bit means the 
application client must handle the initial status for the 
command, then poll for the functional results with REQUEST SENSE
and TEST UNIT READY commands, which is much more complicated.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] x86/mm/pat: Do a small optimization when dump PAT memtype list

2015-07-23 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Pan Xinhui
> Sent: Thursday, July 23, 2015 4:54 AM
> To: linux-kernel@vger.kernel.org
> Subject: [PATCH] x86/mm/pat: Do a small optimization when dump PAT memtype
> list
...
> diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
> index 268b2c8..6302119 100644
> --- a/arch/x86/mm/pat.c
> +++ b/arch/x86/mm/pat.c
> @@ -1001,45 +1001,42 @@ EXPORT_SYMBOL_GPL(pgprot_writethrough);
> 
>  #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_X86_PAT)
> 
> -static struct memtype *memtype_get_idx(loff_t pos)
> +static struct memtype *memtype_get_idx(struct memtype *entry, loff_t pos)
>  {
> - struct memtype *print_entry;
>   int ret;
> 
> - print_entry  = kzalloc(sizeof(struct memtype), GFP_KERNEL);
> - if (!print_entry)
> - return NULL;
> -
>   spin_lock(&memtype_lock);
> - ret = rbt_memtype_copy_nth_element(print_entry, pos);
> + ret = rbt_memtype_copy_nth_element(entry, pos);
>   spin_unlock(&memtype_lock);
> 
> - if (!ret) {
> - return print_entry;
> - } else {
> - kfree(print_entry);
> - return NULL;
> - }
> + return ret ? NULL : entry;
>  }
> 
...
>  static void memtype_seq_stop(struct seq_file *seq, void *v)
>  {
> + kfree(seq->private);
>  }
> 

Consider adding 
seq->private = NULL; 
so the stale pointer isn't left around.  There's probably not
much risk of accessing it, but NULL is safer in case it is.

---
Robert Elliott, HP Server Storage
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH] target: Drop iSCSI use of mutex around max_cmd_sn increment

2015-07-22 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Spencer Baugh
> Sent: Wednesday, July 22, 2015 5:08 PM
> Subject: [PATCH] target: Drop iSCSI use of mutex around max_cmd_sn
> increment
...
> diff --git a/drivers/target/iscsi/iscsi_target_configfs.c
> b/drivers/target/iscsi/iscsi_target_configfs.c
> index c1898c8..29d5930 100644
> --- a/drivers/target/iscsi/iscsi_target_configfs.c
> +++ b/drivers/target/iscsi/iscsi_target_configfs.c
> @@ -706,8 +706,8 @@ static ssize_t lio_target_nacl_show_info(
>   rb += sprintf(page+rb, " 0x%08x   0x%08x   0x%08x   0x%08x"
>   "   0x%08x   0x%08x\n",
>   sess->cmdsn_window,
> - (sess->max_cmd_sn - sess->exp_cmd_sn) + 1,
> - sess->exp_cmd_sn, sess->max_cmd_sn,
> + ((u32) atomic_read(&sess->max_cmd_sn) - sess-
> >exp_cmd_sn) + 1,
> + sess->exp_cmd_sn, (u32) atomic_read(&sess->max_cmd_sn),
>   sess->init_task_tag, sess->targ_xfer_tag);
>   rb += sprintf(page+rb, "--[iSCSI"
>   " Connections]-\n");

Two calls to atomic_read could pick up different values; 
calling it once and using the same value twice would ensure
the arguments are consistent with each other.

> diff --git a/drivers/target/iscsi/iscsi_target_device.c
> b/drivers/target/iscsi/iscsi_target_device.c
> index 5fabcd3..a526904 100644
> --- a/drivers/target/iscsi/iscsi_target_device.c
> +++ b/drivers/target/iscsi/iscsi_target_device.c
...
> @@ -57,9 +57,7 @@ void iscsit_increment_maxcmdsn(struct iscsi_cmd *cmd,
> struct iscsi_session *sess
> 
>   cmd->maxcmdsn_inc = 1;
> 
> - mutex_lock(&sess->cmdsn_mutex);
> - sess->max_cmd_sn += 1;
> - pr_debug("Updated MaxCmdSN to 0x%08x\n", sess->max_cmd_sn);
> - mutex_unlock(&sess->cmdsn_mutex);
> + atomic_inc(&sess->max_cmd_sn);
> + pr_debug("Updated MaxCmdSN to 0x%08x\n", atomic_read(&sess-
> >max_cmd_sn));
>  }
>  EXPORT_SYMBOL(iscsit_increment_maxcmdsn);

If there is another change before the atomic_read, this would
print a value unrelated to the increment done by this thread. 
atomic_inc_return would provide the appropriate value to print.

---
Robert Elliott, HP Server Storage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH V4] x86/mm/pat: Do a small optimization and fix in reserve_memtype

2015-07-22 Thread Elliott, Robert (Server Storage)



---
Robert Elliott, HP Server Storage

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Pan Xinhui
> Sent: Wednesday, July 22, 2015 8:27 AM
...
> diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
> index 188e3e0..8fa1f07 100644
> --- a/arch/x86/mm/pat.c
> +++ b/arch/x86/mm/pat.c
> @@ -521,10 +521,7 @@ int reserve_memtype(u64 start, u64 end, enum
> page_cache_mode req_type,
> 
>   is_range_ram = pat_pagerange_is_ram(start, end);
>   if (is_range_ram == 1) {
> -
> - err = reserve_ram_pages_type(start, end, req_type, new_type);
> -
> - return err;
> + return reserve_ram_pages_type(start, end, req_type, new_type);
>   } else if (is_range_ram < 0) {
>   return -EINVAL;
>   }

With each branch now just one line, the {} can be removed.



N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH] cciss: update copyright

2015-06-25 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Don Brace [mailto:don.br...@pmcs.com]
> Sent: Thursday, June 25, 2015 7:48 PM
> To: gerry.mor...@pmcs.com; ax...@kernel.uk; storage...@pmcs.com; ISS
> StorageDev; scott.ben...@pmcs.com
> Cc: linux-kernel@vger.kernel.org
> Subject: [PATCH] cciss: update copyright
> 
> - add in PMC-Sierra
> - change e-mail address
> 
> Signed-off-by: Don Brace 
> ---
>  drivers/block/cciss.c  |8 ++--
>  drivers/block/cciss.h  |   18 ++
>  drivers/block/cciss_cmd.h  |   18 ++
>  drivers/block/cciss_scsi.c |8 ++--
>  drivers/block/cciss_scsi.h |8 ++--
>  5 files changed, 42 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 0422c47..04b8dfb 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -1,5 +1,6 @@
>  /*
>   *Disk Array driver for HP Smart Array controllers.
> + *(C) Copyright 2014-2015 PMC-Sierra, Inc.
>   *(C) Copyright 2000, 2007 Hewlett-Packard Development Company, L.P.
>   *
>   *This program is free software; you can redistribute it and/or
> modify
> @@ -11,12 +12,7 @@
>   *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>   *General Public License for more details.
>   *
> - *You should have received a copy of the GNU General Public License
> - *along with this program; if not, write to the Free Software
> - *Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
> - *02111-1307, USA.
> - *
> - *Questions/Comments/Bugfixes to iss_storage...@hp.com
> + *Questions/Comments/Bugfixes to storage...@hp.com

wrong company

>   *
>   */
> 
> diff --git a/drivers/block/cciss.h b/drivers/block/cciss.h
> index 7fda30e..2c5be7a 100644
> --- a/drivers/block/cciss.h
> +++ b/drivers/block/cciss.h
> @@ -1,3 +1,21 @@
> +/*
> + *Disk Array driver for HP Smart Array controllers.
> + *(C) Copyright 2014-2015 PMC-Sierra, Inc.
> + *(C) Copyright 2000, 2007 Hewlett-Packard Development Company, L.P.
> + *
> + *This program is free software; you can redistribute it and/or
> modify
> + *it under the terms of the GNU General Public License as published
> by
> + *the Free Software Foundation; version 2 of the License.
> + *
> + *This program is distributed in the hope that it will be useful,
> + *but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + *General Public License for more details.
> + *
> + *Questions/Comments/Bugfixes to storage...@hp.com
> + *
> + */
> +
>  #ifndef CCISS_H
>  #define CCISS_H
> 
> diff --git a/drivers/block/cciss_cmd.h b/drivers/block/cciss_cmd.h
> index d9be6b4..808cb92 100644
> --- a/drivers/block/cciss_cmd.h
> +++ b/drivers/block/cciss_cmd.h
> @@ -1,3 +1,21 @@
> +/*
> + *Disk Array driver for HP Smart Array controllers.
> + *(C) Copyright 2014-2015 PMC-Sierra, Inc.
> + *(C) Copyright 2000, 2007 Hewlett-Packard Development Company, L.P.
> + *
> + *This program is free software; you can redistribute it and/or
> modify
> + *it under the terms of the GNU General Public License as published
> by
> + *the Free Software Foundation; version 2 of the License.
> + *
> + *This program is distributed in the hope that it will be useful,
> + *but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + *General Public License for more details.
> + *
> + *Questions/Comments/Bugfixes to storage...@hp.com

ditto


> + *
> + */
> +
>  #ifndef CCISS_CMD_H
>  #define CCISS_CMD_H
> 
> diff --git a/drivers/block/cciss_scsi.c b/drivers/block/cciss_scsi.c
> index 1537302..13e46c4 100644
> --- a/drivers/block/cciss_scsi.c
> +++ b/drivers/block/cciss_scsi.c
> @@ -1,5 +1,6 @@
>  /*
>   *Disk Array driver for HP Smart Array controllers, SCSI Tape module.
> + *(C) Copyright 2014-2015 PMC-Sierra, Inc.
>   *(C) Copyright 2001, 2007 Hewlett-Packard Development Company, L.P.
>   *
>   *This program is free software; you can redistribute it and/or
> modify
> @@ -11,12 +12,7 @@
>   *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>   *General Public License for more details.
>   *
> - *You should have received a copy of the GNU General Public License
> - *along with this program; if not, write to the Free Software
> - *Foundation, Inc., 59 Temple Place, Suite 300, Boston, MA
> - *02111-1307, USA.
> - *
> - *Questions/Comments/Bugfixes to iss_storage...@hp.com
> + *Questions/Comments/Bugfixes to storage...@hp.com

ditto

>   *
>   *Author: Stephen M. Cameron
>   */
> diff --git a/drivers/block/cciss_scsi.h b/drivers/block/cciss_scsi.h
> index e71d986..e6eb915 100644
> --- a/drivers/block/cciss_scsi.h
> +++ b/drivers/block/cciss_scsi.h
> @@ -1,5 +1,6 @@
>  /*
>   *Disk Array driver for HP Smart A

RE: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

2015-06-10 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, June 10, 2015 9:58 AM
> To: Jeff Moyer
> Cc: linux-nvdimm; Rafael J. Wysocki; linux-kernel@vger.kernel.org; Linux
> ACPI
> Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices
> 
> On Wed, Jun 10, 2015 at 8:54 AM, Jeff Moyer  wrote:
> > Toshi Kani  writes:
> >
> >> Since NVDIMMs are installed on memory slots, they expose the NUMA
> >> topology of a platform.  This patchset adds support of sysfs
> >> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
> >> This enables numactl(8) to accept 'block:' and 'file:' paths of
> >> pmem and btt devices as shown in the examples below.
> >>   numactl --preferred block:pmem0 --show
> >>   numactl --preferred file:/dev/pmem0s --show
> >>
> >> numactl can be used to bind an application to the locality of
> >> a target NVDIMM for better performance.  Here is a result of fio
> >> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
> >> remote settings.
> >>
> >>   Local [1] :  4098.3MB/s
> >>   Remote [2]:  3718.4MB/s
> >>
> >> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio  on-pmem0>
> >> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio  on-pmem0>
> >
> > Did you post the patches to numactl somewhere?
> >
> 
> numactl already supports this today.

numactl does have a bug handling partitions under these devices,
because it assumes all storage devices have "/devices/pci"
in their path as it tries to find the parent device for the
partition.  I think we'll propose a numactl patch for that;
I don't think the drivers can fool it.

Details (from an earlier version of the patch series
in which btt devices were named /dev/nd1, etc.):

strace shows that numactl is trying to find numa_node in very
different locations for /dev/nd1p1 vs. /dev/sda1.

strace for /dev/nd1p1
=
open("/sys/class/block/nd1p1/dev", O_RDONLY) = 4
read(4, "259:1\n", 4095)= 6
close(4)= 0
close(3)= 0
readlink("/sys/class/block/nd1p1", "../../devices/LNXSYSTM:00/LNXSYB"..., 1024) 
= 77
open("/sys/class/block/nd1p1/device/numa_node", O_RDONLY) = -1 ENOENT (No such 
file or directory)

strace for /dev/sda1

open("/sys/class/block/sda1/dev", O_RDONLY) = 4
read(4, "8:1\n", 4095)  = 4
close(4)= 0
close(3)= 0
readlink("/sys/class/block/sda1", "../../devices/pci:00/:00"..., 1024) 
= 91
open("/sys//devices/pci:00/:00:01.0//numa_node", O_RDONLY) = 3
read(3, "0\n", 4095)= 2
close(3)= 0

The "sys/class/block/xxx" paths link to:
lrwxrwxrwx. 1 root root 0 May 20 20:42 /sys/class/block/nd1p1 -> 
../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/btt1/block/nd1/nd1p1
lrwxrwxrwx. 1 root root 0 May 20 20:41 /sys/class/block/sda1 -> 
../../devices/pci:00/:00:01.0/:03:00.0/host6/target6:0:0/6:0:0:0/block/sda/sda1


For /dev/sda1, numactl recognizes "/devices/pci" as
a special path, and strips off everything after the
numbers.  Faced with:
../../devices/pci:00/:00:01.0/:03:00.0/host6/target6:0:0/6:0:0:0/block/sda/sda1

it ends up with this (leaving a sloppy "//" in the path):
/sys/devices/pci:00/:00:01.0//numa_node

It would also succeed if it ended up with this:
/sys/devices/pci:00/:00:01.0/:03:00.0/numa_node

For /dev/nd1p1 it does not see that string, so just
tries to open "/sys/class/block/nd1p1/device/numa_node"

There are no "device/" subdirectories in the tree for
partition devices (for either sda1 or nd1p1), so this 
fails.


>From http://oss.sgi.com/projects/libnuma/
numactl affinity.c:
/* Somewhat hackish: extract device from symlink path.
   Better would be a direct backlink. This knows slightly too
   much about the actual sysfs layout. */
char path[1024];
char *fn = NULL;
if (asprintf(&fn, "/sys/class/%s/%s", cls, dev) > 0 &&
readlink(fn, path, sizeof path) > 0) {
regex_t re;
regmatch_t match[2];
char *p;

regcomp(&re, "(/devices/pci[0-9a-fA-F:/]+\\.[0-9]+)/",
REG_EXTENDED);
ret = regexec(&re, path, 2, match, 0);
regfree(&re);
if (ret == 0) {
free(fn);
assert(match[0].rm_so > 0);
assert(match[0].rm_eo > 0);
path[match[1].rm_eo + 1] = 0;
p = path + match[0].rm_so;
ret = sysfs_node_read(mask, "/sys/%s/numa_node", p);
if (ret < 0)
return node_parse_failure(ret, NULL, p);
return ret;

RE: [PATCH 1/2] scatterlist: use sg_phys()

2015-06-09 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Dan Williams
> Sent: Tuesday, June 09, 2015 10:27 AM
> Subject: [PATCH 1/2] scatterlist: use sg_phys()
> 
...
> diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
> index ed7ba8a11822..dcb3c594d626 100644
> --- a/arch/microblaze/kernel/dma.c
> +++ b/arch/microblaze/kernel/dma.c
> @@ -61,7 +61,7 @@ static int dma_direct_map_sg(struct device *dev, struct
> scatterlist *sgl,
>   /* FIXME this part of code is untested */
>   for_each_sg(sgl, sg, nents, i) {
>   sg->dma_address = sg_phys(sg);
> - __dma_sync(page_to_phys(sg_page(sg)) + sg->offset,
> + __dma_sync(sg_phys(sg),
>   sg->length, direction);
>   }

That one ends up with weird indentation.

RE: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()

2015-06-08 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf
> Of Dan Williams
> Sent: Friday, June 05, 2015 3:19 PM
> Subject: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to
> __pfn_t + kmap_atomic_pfn_t()
...
> diff --git a/arch/powerpc/sysdev/axonram.c
> b/arch/powerpc/sysdev/axonram.c
> index e8657d3bc588..20725006592e 100644
> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
...
> @@ -165,9 +166,13 @@ static int axon_ram_probe(struct platform_device
> *device)
>  {
>   static int axon_ram_bank_id = -1;
>   struct axon_ram_bank *bank;
> - struct resource resource;
> + struct resource *resource;
>   int rc = 0;
> 
> + resource = devm_kzalloc(&device->dev, sizeof(*resource),
> GFP_KERNEL);
> + if (!resource)
> + return -ENOMEM;
> +

Since resource is now a pointer...

...
> @@ -184,13 +189,13 @@ static int axon_ram_probe(struct platform_device
> *device)
> 
>   bank->device = device;
> 
> - if (of_address_to_resource(device->dev.of_node, 0, &resource) != 0)
> {
> + if (of_address_to_resource(device->dev.of_node, 0, resource) != 0) {
>   dev_err(&device->dev, "Cannot access device tree\n");
>   rc = -EFAULT;
>   goto failed;
>   }
...

... I'd expect to see a devm_kfree call added after the failed label, 
like was  done here:

> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 2f1734ba0e22..a7b9743c546f 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
...
>  struct dcssblk_dev_info {
> @@ -520,12 +522,18 @@ static const struct attribute_group
> *dcssblk_dev_attr_groups[] = {
>  static ssize_t
>  dcssblk_add_store(struct device *dev, struct device_attribute *attr, const
> char *buf, size_t count)
>  {
> + struct resource *res = devm_kzalloc(dev, sizeof(*res), GFP_KERNEL);
>   int rc, i, j, num_of_segments;
>   struct dcssblk_dev_info *dev_info;
>   struct segment_info *seg_info, *temp;
>   char *local_buf;
>   unsigned long seg_byte_size;
> 
> + if (!res) {
> + rc = -ENOMEM;
> + goto out_nobuf;
> + }
> +
>   dev_info = NULL;
>   seg_info = NULL;
>   if (dev != dcssblk_root_dev) {
> @@ -652,6 +660,13 @@ dcssblk_add_store(struct device *dev, struct
> device_attribute *attr, const char
>   if (rc)
>   goto put_dev;
> 
> + res->start = dev_info->start;
> + res->end = dev_info->end - 1;
> + rc = devm_register_kmap_pfn_range(&dev_info->dev, res,
> + (void *) dev_info->start);
> + if (rc)
> + goto put_dev;
> +
>   get_device(&dev_info->dev);
>   add_disk(dev_info->gd);
> 
> @@ -699,6 +714,8 @@ seg_list_del:
>  out:
>   kfree(local_buf);
>  out_nobuf:
> + if (res)
> + devm_kfree(dev, res);
>   return rc;
>  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v10 12/12] drivers/block/pmem: Map NVDIMM with ioremap_wt()

2015-05-29 Thread Elliott, Robert (Server Storage)

---
Robert Elliott, HP Server Storage

> -Original Message-
> From: Andy Lutomirski [mailto:l...@amacapital.net]
> Sent: Friday, May 29, 2015 4:46 PM
> To: Elliott, Robert (Server Storage)
> Cc: Dan Williams; Kani, Toshimitsu; Borislav Petkov; Ross Zwisler;
> H. Peter Anvin; Thomas Gleixner; Ingo Molnar; Andrew Morton; Arnd
> Bergmann; linux...@kvack.org; linux-kernel@vger.kernel.org; X86 ML;
> linux-nvd...@lists.01.org; Juergen Gross; Stefan Bader; Henrique de
> Moraes Holschuh; Yigal Korman; Konrad Rzeszutek Wilk; Luis
> Rodriguez; Christoph Hellwig; Matthew Wilcox
> Subject: Re: [PATCH v10 12/12] drivers/block/pmem: Map NVDIMM with
> ioremap_wt()
> 
> On Fri, May 29, 2015 at 2:29 PM, Elliott, Robert (Server Storage)
>  wrote:
> >> -Original Message-
> >> From: Andy Lutomirski [mailto:l...@amacapital.net]
> >> Sent: Friday, May 29, 2015 1:35 PM
> > ...
> >> Whoa, there!  Why would we use non-temporal stores to WB memory
> to
> >> access persistent memory?  I can see two reasons not to:
> >
> > Data written to a block storage device (here, the NVDIMM) is
> unlikely
> > to be read or written again any time soon.  It's not like the code
> > and data that a program has in memory, where there might be a loop
> > accessing the location every CPU clock; it's storage I/O to
> > historically very slow (relative to the CPU clock speed) devices.
> > The source buffer for that data might be frequently accessed,
> > but not the NVDIMM storage itself.
> >
> > Non-temporal stores avoid wasting cache space on these "one-time"
> > accesses.  The same applies for reads and non-temporal loads.
> > Keep the CPU data cache lines free for the application.
> >
> > DAX and mmap() do change that; the application is now free to
> > store frequently accessed data structures directly in persistent
> > memory.  But, that's not available if btt is used, and
> > application loads and stores won't go through the memcpy()
> > calls inside pmem anyway.  The non-temporal instructions are
> > cache coherent, so data integrity won't get confused by them
> > if I/O going through pmem's block storage APIs happens
> > to overlap with the application's mmap() regions.
> >
> 
> You answered the wrong question. :)  I understand the point of the
> non-temporal stores -- I don't understand the point of using
> non-temporal stores to *WB memory*.  I think we should be okay with
> having the kernel mapping use WT instead.

The cache type that the application chooses for its mmap()
view has to be compatible with that already selected by the 
kernel, or we run into:

Intel SDM 11.12.4 Programming the PAT
...
"The PAT allows any memory type to be specified in the page tables,
and therefore it is possible to have a single physical page mapped
to two or more different linear addresses, each with different
memory types. Intel does not support this practice because it may
lead to undefined operations that can result in a system failure. 
In particular, a WC page must never be aliased to a cacheable page
because WC writes may not check the processor caches."

Right now, application memory is always WB, so WB is the
only safe choice from this perspective (the system must have
ADR for safety from other perspectives). That might not be 
the best choice for all applications, though; some applications
might not want CPU caching all the data they run through here 
and prefer WC.  On a non-ADR system, WT might be the only 
safe choice.

Should there be a way for the application to specify a cache
type in its mmap() call? The type already selected by the
kernel driver could (carefully) be changed on the fly if 
it's different.

Non-temporal store performance is excellent under WB, WC, and WT;
if anything, I think WC edges ahead because it need not snoop
the cache. It's still poor under UC.

RE: [PATCH v10 12/12] drivers/block/pmem: Map NVDIMM with ioremap_wt()

2015-05-29 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Andy Lutomirski [mailto:l...@amacapital.net]
> Sent: Friday, May 29, 2015 1:35 PM
...
> Whoa, there!  Why would we use non-temporal stores to WB memory to
> access persistent memory?  I can see two reasons not to:

Data written to a block storage device (here, the NVDIMM) is unlikely
to be read or written again any time soon.  It's not like the code
and data that a program has in memory, where there might be a loop
accessing the location every CPU clock; it's storage I/O to
historically very slow (relative to the CPU clock speed) devices.  
The source buffer for that data might be frequently accessed, 
but not the NVDIMM storage itself.  

Non-temporal stores avoid wasting cache space on these "one-time" 
accesses.  The same applies for reads and non-temporal loads.
Keep the CPU data cache lines free for the application.

DAX and mmap() do change that; the application is now free to
store frequently accessed data structures directly in persistent 
memory.  But, that's not available if btt is used, and 
application loads and stores won't go through the memcpy()
calls inside pmem anyway.  The non-temporal instructions are
cache coherent, so data integrity won't get confused by them
if I/O going through pmem's block storage APIs happens
to overlap with the application's mmap() regions.

---
Robert Elliott, HP Server Storage

RE: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface development

2015-05-25 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf
> Of Dan Williams
> Sent: Wednesday, May 20, 2015 3:58 PM
> To: ax...@kernel.dk
> Subject: [PATCH v3 20/21] nfit-test: manufactured NFITs for interface
> development
...

Attached is some experimental code to try pmem with different 
cache types (UC, WB, WC, and WT) and memcpy functions using x86 
AVX non-temporal load and store instructions.

It depends on Toshi's WT patch series:
https://lkml.org/lkml/2015/5/13/866

If you don't have that, you can just comment out the lines related
to ioremap_wt.

---
Rob Elliott, HP Server Storage



0001-pmem-cache-type
Description: 0001-pmem-cache-type

RE: [PATCH v3 18/21] nd_btt: atomic sector updates

2015-05-22 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, May 20, 2015 3:58 PM
> To: ax...@kernel.dk
> Cc: mi...@kernel.org; linux-nvd...@lists.01.org; ne...@suse.de;
> gre...@linuxfoundation.org; Dave Chinner; linux-kernel@vger.kernel.org; Andy
> Lutomirski; Jens Axboe; linux-a...@vger.kernel.org; H. Peter Anvin;
> h...@lst.de
> Subject: [PATCH v3 18/21] nd_btt: atomic sector updates
> 
> From: Vishal Verma 
> 
...
> diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
> index 00d9afe9475e..2b169806eac5 100644
> --- a/drivers/block/nd/Kconfig
> +++ b/drivers/block/nd/Kconfig
> @@ -32,9 +32,25 @@ config BLK_DEV_PMEM
> capable of DAX (direct-access) file system mappings.  See
> Documentation/blockdev/nd.txt for more details.
> 
> -   Say Y if you want to use a NVDIMM described by NFIT
> +   Say Y if you want to use a NVDIMM described by ACPI, E820, etc...
> 
>  config ND_BTT_DEVS
> - def_bool y
> + bool
> +
> +config ND_BTT
> + tristate "BTT: Block Translation Table (atomic sector updates)"
> + depends on LIBND
> + default LIBND
> + select ND_BTT_DEVS

The ND_BTT option, which is presented during a kernel build,
is missing help text. So is E820_PMEM in patch 3/21.

---
Robert Elliott, HP Server Storage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v3 14/21] libnd: blk labels and namespace instantiation

2015-05-22 Thread Elliott, Robert (Server Storage)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Wednesday, May 20, 2015 3:57 PM
> To: ax...@kernel.dk
> Cc: linux-nvd...@lists.01.org; ne...@suse.de; gre...@linuxfoundation.org;
> linux-kernel@vger.kernel.org; h...@lst.de; linux-a...@vger.kernel.org;
> mi...@kernel.org
> Subject: [PATCH v3 14/21] libnd: blk labels and namespace instantiation
> 
...
> @@ -1029,6 +1244,173 @@ static struct device **create_namespace_pmem(struct
> nd_region *nd_region)
>   return NULL;
>  }
> 
> +struct resource *nsblk_add_resource(struct nd_region *nd_region,
> + struct nd_dimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
> + resource_size_t start)
> +{
> + struct nd_label_id label_id;
> + struct resource *res;
> +
> + nd_label_gen_id(&label_id, nsblk->uuid, NSLABEL_FLAG_LOCAL);
> + nsblk->res = krealloc(nsblk->res,
> + sizeof(void *) * (nsblk->num_resources + 1),
> + GFP_KERNEL);
> + if (!nsblk->res)
> + return NULL;

scripts/checkpatch.pl doesn't like that:
WARNING: Reusing the krealloc arg is almost always a bug
#1411: FILE: drivers/block/nd/namespace_devs.c:1411:
+   nsblk->res = krealloc(nsblk->res,

The reasoning (https://lkml.org/lkml/2013/3/14/558) is:

"If krealloc() returns NULL, it *doesn't* free the original. So any 
code of the form 'foo = krealloc(foo, …);' is almost certainly a bug."


---
Robert Elliott, HP Server Storage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates

2015-05-20 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: dan.j.willi...@gmail.com [mailto:dan.j.willi...@gmail.com] On
> Behalf Of Dan Williams
> Sent: Saturday, May 16, 2015 10:22 PM
> To: Elliott, Robert (Server Storage)
> Cc: linux-nvd...@lists.01.org; Neil Brown; Greg KH; Dave Chinner; linux-
> ker...@vger.kernel.org; Andy Lutomirski; Jens Axboe; H. Peter Anvin;
> Christoph Hellwig; Ingo Molnar
> Subject: Re: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector
> updates
> 
> On Sat, May 16, 2015 at 6:19 PM, Elliott, Robert (Server Storage)
>  wrote:
> >
...
> 2 items to check:
> 
> 1/ make sure you have a your btt sector size set to 4k which cuts down
> the overhead by a factor of 8.
> 
> 2/ boot with nr_cpus=256 or lower.
> 
> Ross noticed that CONFIG_NR_CPUS is set quite high on distro kernels
> which revealed that we should have been using nr_cpu_ids and percpu
> variables for nd_region_acquire_lane() from the outset.  This fix is
> coming in v3.

My system does have CONFIG_NR_CPUS=8192.  I'll try the items you 
suggested.

RE: [Linux-nvdimm] [PATCH v2 07/20] libnd, nd_dimm: dimm driver and base libnd device-driver infrastructure

2015-05-20 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf
> Of Dan Williams
> Sent: Tuesday, April 28, 2015 1:25 PM
> To: linux-nvd...@lists.01.org
> Cc: Neil Brown; Greg KH; linux-kernel@vger.kernel.org
> Subject: [Linux-nvdimm] [PATCH v2 07/20] libnd, nd_dimm: dimm driver and
> base libnd device-driver infrastructure
> 
...
> diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
> new file mode 100644
> index ..a4c8e3ffe97c
> --- /dev/null
> +++ b/drivers/block/nd/dimm.c
> @@ -0,0 +1,93 @@
> +/*
> + * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "nd.h"
> +
> +static void free_data(struct nd_dimm_drvdata *ndd)
> +{
> + if (!ndd)
> + return;
> +
> + if (ndd->data && is_vmalloc_addr(ndd->data))
> + vfree(ndd->data);
> + else
> + kfree(ndd->data);
> + kfree(ndd);
> +}
> +
> +static int nd_dimm_probe(struct device *dev)
> +{
> + struct nd_dimm_drvdata *ndd;
> + int rc;
> +
> + ndd = kzalloc(sizeof(*ndd), GFP_KERNEL);
> + if (!ndd)
> + return -ENOMEM;
> +
> + dev_set_drvdata(dev, ndd);
> +ndd->dev = dev;
> +
> + rc = nd_dimm_init_nsarea(ndd);
> + if (rc)
> + goto err;
> +
> + rc = nd_dimm_init_config_data(ndd);
> + if (rc)
> + goto err;
> +
> + dev_dbg(dev, "config data size: %d\n", ndd->nsarea.config_size);
> +
> + return 0;
> +
> + err:
> + free_data(ndd);
> + return rc;
> +
> +}
> +
> +static int nd_dimm_remove(struct device *dev)
> +{
> + struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
> +
> + free_data(ndd);
> +
> + return 0;
> +}

It would reduce the slight window for stale pointer usage if you add:
dev_set_drvdata(dev, NULL);

before
free_data(ndd);

in both nd_dimm_remove and the err exit for nd_dimm_probe.

The same comment applies to all the drivers that store pointers
with dev_set_drvdata - btt, pmem, etc.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates

2015-05-16 Thread Elliott, Robert (Server Storage)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Tuesday, April 28, 2015 1:26 PM
> To: linux-nvd...@lists.01.org
> Cc: Ingo Molnar; Neil Brown; Greg KH; Dave Chinner; linux-
> ker...@vger.kernel.org; Andy Lutomirski; Jens Axboe; H. Peter Anvin;
> Christoph Hellwig
> Subject: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates
> 
> From: Vishal Verma 
> 
> BTT stands for Block Translation Table, and is a way to provide power
> fail sector atomicity semantics for block devices that have the ability
> to perform byte granularity IO. It relies on the ->rw_bytes() capability
> of provided nd namespace devices.
> 
> The BTT works as a stacked blocked device, and reserves a chunk of space
> from the backing device for its accounting metadata.  BLK namespaces may
> mandate use of a BTT and expect the bus to initialize a BTT if not
> already present.  Otherwise if a BTT is desired for other namespaces (or
> partitions of a namespace) a BTT may be manually configured.
...

Running btt above pmem with a variety of workloads, I see an awful lot 
of time spent in two places:
* _raw_spin_lock 
* btt_make_request

This occurs for fio to raw /dev/ndN devices, ddpt over ext4 or xfs,
cp -R of large directories, and running make on the linux kernel.

Some specific results:

fio 4 KiB random reads, WC cache type, memcpy:
* 43175 MB/s,   8 M IOPS  pmem0 and pmem1
* 18500 MB/s, 1.5 M IOPS  nd0 and nd1

fio 4 KiB random reads, WC cache type, memcpy with non-temporal
loads (when everything is 64-byte aligned):
* 33814 MB/s, 4.3 M IOPS  nd0 and nd1

Zeroing out 32 MiB with ddpt:
* 19 s, 1800 MiB/s  pmem
* 55 s,  625 MiB/s  btt

If btt_make_request needs to stall this much, maybe it'd be better
to utilize the blk-mq request queues, keeping requests in per-CPU
queues while they're waiting, and using IPIs for completion 
interrupts when they're finally done.


fio 4 KiB random reads without non-temporal memcpy
==
perf top shows memcpy_erms taking all the time, a function that
uses 8-byte REP; MOVSB instructions:
 85.78%  [kernel] [k] memcpy_erms
  1.21%  [kernel] [k] _raw_spin_lock
  0.72%  [nd_btt] [k] btt_make_request
  0.67%  [kernel] [k] do_blockdev_direct_IO
  0.47%  fio  [.] get_io_u

fio 4 KiB random reads with non-temporal memcpy
===
perf top shows there are still quite a few unaligned accesses
resulting in legacy memcpy, but about equal time is now spent
in legacy vs NT memcpy:
 30.47%  [kernel][k] memcpy_erms
 26.27%  [kernel][k] memcpy_lnt_st_64
  5.37%  [kernel][k] _raw_spin_lock
  2.20%  [kernel][k] btt_make_request
  2.03%  [kernel][k] do_blockdev_direct_IO
  1.41%  fio [.] get_io_u
  1.22%  [kernel][k] btt_map_read
  1.15%  [kernel][k] pmem_rw_bytes
  1.01%  [kernel][k] nd_btt_rw_bytes
  0.98%  [kernel][k] nd_region_acquire_lane
  0.89%  fio [.] get_next_rand_block
  0.88%  fio [.] thread_main
  0.79%  fio [.] ios_completed
  0.76%  fio [.] td_io_queue
  0.75%  [kernel][k] _raw_spin_lock_irqsave
  0.68%  [kernel][k] kmem_cache_free
  0.66%  [kernel][k] kmem_cache_alloc
  0.59%  [kernel][k] __audit_syscall_exit
  0.57%  [kernel][k] aio_complete
  0.54%  [kernel][k] do_io_submit
  0.52%  [kernel][k] _raw_spin_unlock_irqrestore

fio randrw workload
===
perf top shows that adding writes to the mix brings btt_make_request
its cpu_relax() loop to the forefront:
  21.09%  [nd_btt]  [k] btt_make_request 
  19.06%  [kernel]  [k] memcpy_erms  
  14.35%  [kernel]  [k] _raw_spin_lock   
  10.38%  [nd_pmem] [k] memcpy_lnt_st_64
   1.57%  [kernel]  [k] do_blockdev_direct_IO   
   1.51%  [nd_pmem] [k] memcpy_lt_snt_64  
   1.43%  [nd_btt]  [k] nd_btt_rw_bytes   
   1.39%  [kernel]  [k] radix_tree_next_chunk  
   1.33%  [kernel]  [k] put_page 
   1.21%  [nd_pmem] [k] pmem_rw_bytes  
   1.11%  fio   [.] get_io_u  
   0.90%  fio   [.] io_u_queued_complete  
   0.74%  [kernel]  [k] system_call 
   0.72%  [libnd]   [k] nd_region_acquire_lane   
   0.71%  [nd_btt]  [k] btt_map_read
   0.62%  fio   [.]

RE: [Linux-nvdimm] [PATCH v2 18/20] libnd: infrastructure for btt devices

2015-05-14 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf
> Of Dan Williams
> Sent: Thursday, May 14, 2015 7:42 PM
> To: Kani, Toshimitsu
> Cc: Neil Brown; Greg KH; linux-kernel@vger.kernel.org; linux-
> nvd...@lists.01.org
> Subject: Re: [Linux-nvdimm] [PATCH v2 18/20] libnd: infrastructure for
> btt devices
> 
...
> So we can fix this to be at least as stable as the backing device
> names [1], but as far as I can see we would need to start using the
> backing device name in the btt device name.  A strawman proposal is to
> append 's' to indicated 'sectored'.  So /dev/pmem0s is the btt
> instance fronting /dev/pmem0.  Other examples:
> 
> /dev/pmem0p1s
> /dev/ndblk0.0s
> /dev/ndblk0.0p1s
> ...
> 
> Thoughts?
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-April/000636.html

I like that; it also hints to the user that another driver has already
claimed /dev/pmem0, similar to how the presence of /dev/sda1, /dev/sda2,
etc. hints that a program has partitioned /dev/sda.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v5 1/6] x86/mm/pat: use pr_info() and friends

2015-05-06 Thread Elliott, Robert (Server Storage)

> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Luis R. Rodriguez
> Sent: Thursday, April 30, 2015 3:25 PM
> Subject: [PATCH v5 1/6] x86/mm/pat: use pr_info() and friends
> 
...
> - printk(KERN_ERR "%s:%d map pfn expected mapping
> type %s"
> - " for [mem %#010Lx-%#010Lx], got %s\n",
> - current->comm, current->pid,
> - cattr_name(want_pcm),
> - (unsigned long long)paddr,
> - (unsigned long long)(paddr + size - 1),
> - cattr_name(pcm));
> + pr_err("%s:%d map pfn expected mapping type %s"
> +" for [mem %#010Lx-%#010Lx], got %s\n",

Since the patch joins some other print format strings split across 
lines (which checkpatch allows), you might want to join this one too.

...
> diff --git a/arch/x86/mm/pat_rbtree.c b/arch/x86/mm/pat_rbtree.c
...
>  failure:
> - printk(KERN_INFO "%s:%d conflicting memory types "
> + pr_info("%s:%d conflicting memory types "
>   "%Lx-%Lx %s<->%s\n", current->comm, current->pid, start,
>   end, cattr_name(found_type), cattr_name(match->type));

and that one.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 0/13] Parallel struct page initialisation v4

2015-05-02 Thread Elliott, Robert (Server Storage)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Daniel J Blueman
> Sent: Thursday, April 30, 2015 11:10 AM
> Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4
...
> On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're
> seeing stock 4.0 boot in 7136s. This drops to 2159s, or a 70% reduction
> with this patchset. Non-temporal PMD init [1] drops this to 1045s.
> 
> Nathan, what do you guys see with the non-temporal PMD patch [1]? Do
> add a sfence at the ende label if you manually patch.
> 
...
> [1] https://lkml.org/lkml/2015/4/23/350

From that post:
> +loop_64:
> + decq  %rcx
> + movnti  %rax,(%rdi)
> + movnti  %rax,8(%rdi)
> + movnti  %rax,16(%rdi)
> + movnti  %rax,24(%rdi)
> + movnti  %rax,32(%rdi)
> + movnti  %rax,40(%rdi)
> + movnti  %rax,48(%rdi)
> + movnti  %rax,56(%rdi)
> + leaq  64(%rdi),%rdi
> + jnzloop_64

There are some even more efficient instructions available in x86,
depending on the CPU features:
* movnti8 byte
* movntdq %xmm  16 byte, SSE
* vmovntdq %ymm 32 byte, AVX
* vmovntdq %zmm 64 byte, AVX-512 (forthcoming)

The last will transfer a full cache line at a time.

For NVDIMMs, the nd pmem driver is also in need of memcpy functions that 
use these non-temporal instructions, both for performance and reliability.
We also need to speed up __clear_page and copy_user_enhanced_string so
userspace accesses through the page cache can keep up.
https://lkml.org/lkml/2015/4/2/453 is one of the threads on that topic.

Some results I've gotten there under different cache attributes
(in terms of 4 KiB IOPS):

16-byte movntdq:
UC write iops=697872 (697.872 K)(0.697872 M)
WB write iops=9745800 (9745.8 K)(9.7458 M)
WC write iops=9801800 (9801.8 K)(9.8018 M)
WT write iops=9812400 (9812.4 K)(9.8124 M)

32-byte vmovntdq:
UC write iops=1274400 (1274.4 K)(1.2744 M)
WB write iops=10259000 (10259 K)(10.259 M)
WC write iops=10286000 (10286 K)(10.286 M)
WT write iops=10294000 (10294 K)(10.294 M)

---
Robert Elliott, HP Server Storage

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [Linux-nvdimm] [PATCH v2 08/20] libnd, nd_acpi: regions (block-data-window, persistent memory, volatile memory)

2015-04-29 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Tuesday, April 28, 2015 1:25 PM
> Subject: [Linux-nvdimm] [PATCH v2 08/20] libnd, nd_acpi: regions (block-
> data-window, persistent memory, volatile memory)
> 
> A "region" device represents the maximum capacity of a BLK range (mmio
> block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
> volatile memory), without regard for aliasing.  Aliasing, in the
> dimm-local address space (DPA), is resolved by metadata on a dimm to
> designate which exclusive interface will access the aliased DPA ranges.
> Support for the per-dimm metadata/label arrvies is in a subsequent
> patch.
> 
> The name format of "region" devices is "regionN" where, like dimms, N is
> a global ida index assigned at discovery time.  This id is not reliable
> across reboots nor in the presence of hotplug.  Look to attributes of
> the region or static id-data of the sub-namespace to generate a
> persistent name.
...
> +++ b/drivers/block/nd/region_devs.c
...
> +static noinline struct nd_region *nd_region_create(struct nd_bus *nd_bus,
> + struct nd_region_desc *ndr_desc, struct device_type *dev_type)
> +{
> + struct nd_region *nd_region;
> + struct device *dev;
> + u16 i;
> +
> + for (i = 0; i < ndr_desc->num_mappings; i++) {
> + struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
> + struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
> +
> + if ((nd_mapping->start | nd_mapping->size) % SZ_4K) {
> + dev_err(&nd_bus->dev, "%pf: %s mapping%d is not 4K
> aligned\n",
> + __builtin_return_address(0),

Please use "KiB" rather than the unclear "K".

Same comment for a dev_dbg print in patch 14.

> + dev_name(&nd_dimm->dev), i);
> +
> + return NULL;
> + }
> + }
> +
> + nd_region = kzalloc(sizeof(struct nd_region)
> + + sizeof(struct nd_mapping) * ndr_desc->num_mappings,
> + GFP_KERNEL);
> + if (!nd_region)
> + return NULL;
> + nd_region->id = ida_simple_get(®ion_ida, 0, 0, GFP_KERNEL);
> + if (nd_region->id < 0) {
> + kfree(nd_region);
> + return NULL;
> + }
> +
> + memcpy(nd_region->mapping, ndr_desc->nd_mapping,
> + sizeof(struct nd_mapping) * ndr_desc->num_mappings);
> + for (i = 0; i < ndr_desc->num_mappings; i++) {
> + struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i];
> + struct nd_dimm *nd_dimm = nd_mapping->nd_dimm;
> +
> + get_device(&nd_dimm->dev);
> + }
> + nd_region->ndr_mappings = ndr_desc->num_mappings;
> + nd_region->provider_data = ndr_desc->provider_data;
> + dev = &nd_region->dev;
> + dev_set_name(dev, "region%d", nd_region->id);

Could this include "nd" in the name, like "ndregion%d"?

The other dev_set_name calls in this patch set use:
btt%d
ndbus%d
nmem%d
namespace%d.%d

which are a bit more distinctive.

---
Robert Elliott, HP Server Storage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Linux-nvdimm] [PATCH v2 00/20] libnd: non-volatile memory device support

2015-04-28 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Tuesday, April 28, 2015 1:24 PM
> To: linux-nvd...@lists.01.org
> Cc: Neil Brown; Dave Chinner; H. Peter Anvin; Christoph Hellwig; Rafael J.
> Wysocki; Robert Moore; Ingo Molnar; linux-a...@vger.kernel.org; Jens Axboe;
> Borislav Petkov; Thomas Gleixner; Greg KH; linux-kernel@vger.kernel.org;
> Andy Lutomirski; Andrew Morton; Linus Torvalds
> Subject: [Linux-nvdimm] [PATCH v2 00/20] libnd: non-volatile memory device
> support
> 
> Changes since v1 [1]: Incorporates feedback received prior to April 24.

Here are some comments on the sysfs properties reported for a pmem device.
They are based on v1, but I don't think v2 changes anything.

1. This confuses lsblk (part of util-linux):
/sys/block/pmem0/device/type:4

lsblk shows:
NAME  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
pmem0 251:00 8G  0 worm
pmem1 251:16   0 8G  0 worm
pmem2 251:32   0 8G  0 worm
pmem3 251:48   0 8G  0 worm
pmem4 251:64   0 8G  0 worm
pmem5 251:80   0 8G  0 worm
pmem6 251:96   0 8G  0 worm
pmem7 251:112  0 8G  0 worm

lsblk's blkdev_scsi_type_to_name() considers 4 to mean 
SCSI_TYPE_WORM (write once read many ... used for certain optical
and tape drives).

I'm not sure what nd and pmem are doing to result in that value.

2. To avoid confusing software trying to detect fast storage vs.
slow storage devices via sysfs, this value should be 0:
/sys/block/pmem0/queue/rotational:1

That can be done by adding this shortly after the blk_alloc_queue call:
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);

3. Is there any reason to have a 512 KiB limit on the transfer
length?
/sys/block/pmem0/queue/max_hw_sectors_kb:512

That is from:
   blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);

4. These are read-writeable, but IOs never reach a queue, so 
the queue size is irrelevant and merging never happens:
/sys/block/pmem0/queue/nomerges:0
/sys/block/pmem0/queue/nr_requests:128

Consider making them both read-only with: 
* nomerges set to 2 (no merging happening) 
* nr_requests as small as the block layer allows to avoid 
wasting memory.

5. No scatter-gather lists are created by the driver, so these
read-only fields are meaningless:
/sys/block/pmem0/queue/max_segments:128
/sys/block/pmem0/queue/max_segment_size:65536

Is there a better way to report them as irrelevant?

6. There is no completion processing, so the read-writeable
cpu affinity is not used:
/sys/block/pmem0/queue/rq_affinity:0

Consider making it read-only and set to 2, meaning the
completions always run on the requesting CPU.

7. With mmap() allowing less than logical block sized accesses
to the device, this could be considered misleading:
/sys/block/pmem0/queue/physical_block_size:512

Perhaps that needs to be 1 byte or a cacheline size (64 bytes
on x86) to indicate that direct partial logical block accesses
are possible.  The btt driver could report 512 as one indication
it is different.

I wouldn't be surprised if smaller values than the logical block
size confused some software, though.

---
Robert Elliott, HP Server Storage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Linux-nvdimm] [PATCH 19/21] nd: infrastructure for btt devices

2015-04-22 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Friday, April 17, 2015 8:37 PM
> To: linux-nvd...@lists.01.org
> Subject: [Linux-nvdimm] [PATCH 19/21] nd: infrastructure for btt devices
> 
...
> +/*
> + * btt_sb_checksum: compute checksum for btt info block
> + *
> + * Returns a fletcher64 checksum of everything in the given info block
> + * except the last field (since that's where the checksum lives).
> + */
> +u64 btt_sb_checksum(struct btt_sb *btt_sb)
> +{
> + u64 sum, sum_save;
> +
> + sum_save = btt_sb->checksum;
> + btt_sb->checksum = 0;
> + sum = nd_fletcher64(btt_sb, sizeof(*btt_sb));
> + btt_sb->checksum = sum_save;
> + return sum;
> +}
> +EXPORT_SYMBOL(btt_sb_checksum);
...

Of all the functions with prototypes in nd.h, this is the only 
function that doesn't have a name starting with nd_.

Following such a convention helps ease setting up ftrace filters.

---
Robert Elliott, HP Server Storage

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Linux-nvdimm] [PATCH 00/21] ND: NFIT-Defined / NVDIMM Subsystem

2015-04-22 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Dan Williams
> Sent: Friday, April 17, 2015 8:35 PM
> To: linux-nvd...@lists.01.org
> Subject: [Linux-nvdimm] [PATCH 00/21] ND: NFIT-Defined / NVDIMM Subsystem
> 
...
>  create mode 100644 drivers/block/nd/acpi.c
>  create mode 100644 drivers/block/nd/blk.c
>  create mode 100644 drivers/block/nd/bus.c
>  create mode 100644 drivers/block/nd/core.c
...

The kernel already has lots of files with these names:
 5 acpi.c
10 bus.c
66 core.c

I often use ctags like this:
vim -t core.c
but that doesn’t immediately work with common filenames - it 
presents a list of all 66 files to choose from.

Also, blk.c is a name one might expect to see in the block/ 
directory (e.g., next to blk.h).

An nd_ prefix on all the filenames would help.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [tip:x86/pmem] x86/mm: Add support for the non-standard protected e820 type

2015-04-16 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Andy Lutomirski
> Sent: Thursday, April 16, 2015 5:31 PM
> To: Ingo Molnar
> Subject: Re: [tip:x86/pmem] x86/mm: Add support for the non-standard
> protected e820 type
> 
> On Thu, Apr 2, 2015 at 12:51 PM, Andy Lutomirski 
> wrote:
> > On Thu, Apr 2, 2015 at 12:13 PM, Ingo Molnar  wrote:
> >>
> >> * Andy Lutomirski  wrote:
> >>
> >>> On Thu, Apr 2, 2015 at 5:31 AM, tip-bot for Christoph Hellwig
> >>>  wrote:
> >>> > Commit-ID:  ec776ef6bbe1734c29cd6bd05219cd93b2731bd4
> >>> > Gitweb:
> http://git.kernel.org/tip/ec776ef6bbe1734c29cd6bd05219cd93b2731bd4
> >>> > Author: Christoph Hellwig 
> >>> > AuthorDate: Wed, 1 Apr 2015 09:12:18 +0200
> >>> > Committer:  Ingo Molnar 
> >>> > CommitDate: Wed, 1 Apr 2015 17:02:43 +0200
> >>> >
> >>> > x86/mm: Add support for the non-standard protected e820 type
> >>> >
> >>> > Various recent BIOSes support NVDIMMs or ADR using a
> >>> > non-standard e820 memory type, and Intel supplied reference
> >>> > Linux code using this type to various vendors.
> >>> >
> >>> > Wire this e820 table type up to export platform devices for the
> >>> > pmem driver so that we can use it in Linux.
> >>>
> >>> This scares me a bit.  Do we know that the upcoming ACPI 6.0
> >>> enumeration mechanism *won't* use e820 type 12? [...]
> >>
> >> So I know nothing about it, but I'd be surprised if e820 was touched
> >> at all, as e820 isn't really well suited to enumerate more complex
> >> resources, and it appears pmem wants to grow into complex directions?
> >
> > I hope so, but I have no idea what the ACPI committee's schemes are.
> >
> > We could require pmem.enable_legacy_e820=Y to load the driver for now
> > if we're concerned about it.
> >
> 
> ACPI 6.0 is out:
> 
> http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> 
> AFAICT from a quick read, ACPI 6.0 systems will show EFI type 14
> (EfiPersistentMemory), ACPI type 7 (AddressRangePersistentMemory) and
> e820 type 7.
> 
> Type 12 is still "OEM defined".  See table 15-312.  Maybe I'm reading
> this wrong.
> 
> *However*, ACPI 6.0 unsurprisingly also has a real enumeration
> mechanism for NVDIMMs and such, and those should take precedence.
> 
> So this driver could plausibly be safe even on ACPI 6.0 systems.
> Someone from one of the relevant vendors should probably confirm that.
> I'm still a bit nervous, though.

That value was set aside on behalf of this pre-standard usage, to
keep future ACPI revisions from standardizing it for anything else.
The kernel should be just as safe in continuing to recognize that
value as it is now.

New legacy BIOS systems should follow ACPI 6.0 going forward and
report type 7 in their E820 tables, along with meeting the other
ACPI 6.0 requirements.

UEFI systems don't provide an E820 table; that's an artificial
creation by the bootloader (e.g., grub2) or the kernel with its
add_efi_memmmap parameter. These systems will just report the
EFI memory type 14. There was no pre-standard EFI type
identified that needed to be blocked out.

Any software creating a fake E820 table (grub2, etc.) should 
map EFI type 14 to E820 type 7.

---
Robert Elliott, HP Server Storage

RE: [Linux-nvdimm] [GIT PULL] PMEM driver for v4.1

2015-04-14 Thread Elliott, Robert (Server Storage)


> -Original Message-
> From: Dan Williams [mailto:dan.j.willi...@intel.com]
> Sent: Tuesday, April 14, 2015 11:34 AM
> To: Elliott, Robert (Server Storage)
> > ...
> >> Since it's directly mapped it should just work for most things if it's
> >> at least write-through cached (UC would be a horror), and it would
> >> also solve all the size problems. With write-through caching it should
> >> also be pretty OK performance-wise. The 64 bytes size is ideal as
> >
> > Are the WT support patches going to make it into 4.1?
> 
> Which patches are these?  Maybe I missed them, but I don't see
> anything in the archives.

These have been baking in linux-mm and linux-next:
* [PATCH v3 0/6] Kernel huge I/O mapping support
  https://lkml.org/lkml/2015/3/3/589
* [PATCH v4 0/7] mtrr, mm, x86: Enhance MTRR checks for huge I/O mapping
  https://lkml.org/lkml/2015/3/24/1056

I don't think this made it into a subsystem tree yet:
* [PATCH v8 0/7] Support Write-Through mapping on x86 
  https://lkml.org/lkml/2015/2/24/773

I guess we could target 4.2 for both the WT series and
pmem patches that support the new ioremap_wt() function.

---
Robert Elliott, HP Server Storage

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [Linux-nvdimm] [GIT PULL] PMEM driver for v4.1

2015-04-14 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf
> Of Ingo Molnar
> Sent: Tuesday, April 14, 2015 7:42 AM
> To: Christoph Hellwig
> Subject: Re: [Linux-nvdimm] [GIT PULL] PMEM driver for v4.1
> 
> 
...
> Since it's directly mapped it should just work for most things if it's
> at least write-through cached (UC would be a horror), and it would
> also solve all the size problems. With write-through caching it should
> also be pretty OK performance-wise. The 64 bytes size is ideal as

Are the WT support patches going to make it into 4.1?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

pmem and i_dio_count overhead

2015-04-03 Thread Elliott, Robert (Server Storage)

Jens, one of your patches from October 2013 never made it 
to the kernel, but would be beneficial for pmem.  It helps
IOPS about 15%.

Original patch: https://lkml.org/lkml/2013/10/24/130 

> From Jens Axboe 
> Subject [PATCH 05/11] direct-io: only inc/dec inode->i_dio_count for file 
> systems 
> Date Thu, 24 Oct 2013 10:25:58 +0100 
>
> We don't need truncate protection for block devices, so add a flag
> bypassing this cache line dirtying twice for every IO. This easily
> contributes to 5-10% of the CPU time on high IOPS O_DIRECT testing.

Here are perf top results while running fio to pmem devices 
using memcpy with non-temporal load and store instructions:

 20.54%  [pmem]   [k] pmem_do_bvec.isra.6   
 10.13%  [kernel] [k] do_blockdev_direct_IO
  5.93%  [kernel] [k] inode_dio_done
  4.46%  [kernel] [k] bio_endio
  3.07%  fio  [.] get_io_u
  2.08%  fio  [.] do_io

Inside do_blockdev_direct_io (10%), 60% of the time is spent
atomically incrementing i_dio_count:

   │  static inline void atomic_inc(atomic_t *v)
   │  {
   │  asm volatile(LOCK_PREFIX "incl %0"
  0.06 │ 225:   lock   incl   0x134(%r14)
   │  atomic_inc(&inode->i_dio_count);
   │
   │  retval = 0;
   │  sdio.blkbits = blkbits;
   │  sdio.blkfactor = i_blkbits - blkbits;
   │  sdio.block_in_file = offset >> blkbits;
 60.31 │mov-0x1d0(%rbp),%rdx
  0.16 │mov%r12d,%ecx
   │   */
   │  atomic_inc(&inode->i_dio_count);
   │
   │  retval = 0;
   │  sdio.blkbits = blkbits;
   │  sdio.blkfactor = i_blkbits - blkbits;
  0.00 │sub%r12d,%ebx
   │   * Will be decremented at I/O completion time.
   │   */
   │  atomic_inc(&inode->i_dio_count);

inode_dio_done is taking all of its 5.8% time doing the 
corresponding atomic_dec.

So, they're combining for 11.8% of the overall CPU time.
The problem is more atomic contention than cache line dirtying.

Applying your patch (changing the bitmask from 0x04 to
0x08, since 0x04 is taken now) eliminates those 
instructions from perf top and improves the high IOPS 
results by 5 to 15%.

AttrCopyRead IOPS   Write IOPS
=   ==
UC  NT rd,wr513 K   326 K
with the patch: 510 K   325 K

WB  NT rd,wr3.3 M   3.5 M
with the patch: 3.8 M   3.9 M

WC  NT rd,wr3.0 M   3.9 M
with the patch: 3.1 M   4.1 M

WT  NT rd,wr3.3 M   2.1 M
with the patch: 3.7 M   3.7 M

(there is some other test environment inconsistency
with WT writes - I don't think this change really
helped by 76%)

---
Robert Elliott, HP Server Storage


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: another pmem variant V2

2015-04-02 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Wednesday, April 1, 2015 2:26 AM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; linux-nvd...@ml01.01.org; linux-
> fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; x...@kernel.org;
> ross.zwis...@linux.intel.com; ax...@kernel.dk; b...@plexistor.com; Kani,
> Toshimitsu
> Subject: Re: another pmem variant V2
> 
> On Tue, Mar 31, 2015 at 10:11:29PM +, Elliott, Robert (Server Storage)
> wrote:
> > I used fio to test 4 KiB random read and write IOPS
> > on a 2-socket x86 DDR4 system.  With various cache attributes:
> >
> > attrreadwrite   notes
> > -   -
> > UC  37 K21 Kioremap_nocache
> > WB  3.6 M   2.5 M   ioremap
> > WC  764 K   3.7 M   ioremap_wc
> > WT  ioremap_wt
> >
> > So, although UC and WT are the only modes certain to be safe,
> > the V1 default of UC provides abysmal performance - worse than
> > a consumer-class SATA SSD.
> 
> It doesn't look quite as bad on my setup, but performance is fairly
> bad here as well.
> 
> > A solution for x86 is to use the MOVNTI instruction in WB
> > mode. This non-temporal hint uses a buffer like the write
> > combining buffer, not filling the cache and not stopping
> > everything in the CPU.  The kernel function __copy_from_user()
> > uses that instruction (with SFENCE at the end) - see
> > arch/x86/lib/copy_user_nocache_64.S.
> >
> > If I made the change from memcpy() to __copy_from_user()
> > correctly, that results in:
> >
> > attrreadwrite   notes
> > -   -
> > WB w/NTI2.4 M   2.6 M   __copy_from_user()
> > WC w/NTI3.2 M   2.1 M   __copy_from_user()
> 
> That looks a lot better.  It doesn't help us with a pmem device
> mapped directly into userspace using mmap with the DAX infrastructure,
> though.
> 
> Note when we want to move to non-temporal copies we'll need to add
> a new prototype, as __copy_from_user isn't guaranteed to use these,
> and it is defined to only work on user addresses.  That doesn't matter
> on x86 but would blow up on say sparc or s390.

Here are some updated numbers including:
* WT (writethrough) cache attribute
* memcpy that uses non-temporal stores (MOVNTDQ) to the 
  persistent memory for block writes (rather than MOVNTI)
* memcpy that uses non-temporal loads (MOVNTDQA) from the 
  persistent memory for block reads

AttrCopyRead IOPS   Write IOPS
=   ==
UC  memcpy  36 K22 K
UC  NT rd,wr513 K   326 K

WB  memcpy  3.4 M   2.5 M
WB  NT rd,wr3.3 M   3.5 M

WC  memcpy  776 K   3.5 M
WC  NT rd,wr3.0 M   3.9 M

WT  memcpy  2.1 M   22 K
WT  NT rd,wr3.3 M   2.1 M

a few other variations yielded the peak numbers:
WC  NT rd only  3.2 M   4.1 M
WC  NT wr only  712 K   4.6 M
WT  NT wr only  2.6 M   4.0 M

There are lots of tuning considerations for those memcpy 
functions - how far to unroll the loop, whether to
include PRFETCHNTA instructions, etc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: another pmem variant V2

2015-04-01 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 3:33 AM
> To: linux-nvd...@ml01.01.org; linux-fsde...@vger.kernel.org; linux-
> ker...@vger.kernel.org; x...@kernel.org
> Cc: ross.zwis...@linux.intel.com; ax...@kernel.dk; b...@plexistor.com
> Subject: another pmem variant V2
> 

I triggered a paging error in the memcpy call for a block read
from system-udevd (actually in a modified memcpy() for the cache
attribute experiments).  

1. This triggered an illegal schedule() call from an atomic context.
The call trace is shown below.

2. memcpy() doesn't provide exception handling or error reporting.
Some functions like do so, like __copy_user_nocache in
arch/x85/lib/copy_user_nocache_64.S.  

Should pmem only use functions that do so, if available on the
architecture?

pmem_rw_page can pass along the return value from the copy function.
pmem_make_request can report the error, if any, via bio_endio.


Call trace
==
[62117.317216] BUG: scheduling while atomic: systemd-udevd/22135/0x0001
[62117.317232] Modules linked in: pmem ip6table_filter ip6_tables 
iptable_filter ip_tables ebtable_nat ebtables sg vfat fat x86_pkg_temp_thermal 
coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 
glue_helper lrw gf128mul ablk_helper cryptd xhci_pci hpilo xhci_hcd sb_edac 
edac_core microcode iTCO_wdt iTCO_vendor_support hpwdt ioatdma shpchp pcspkr 
lpc_ich mfd_core i2c_i801 wmi pcc_cpufreq dca acpi_cpufreq uinput nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc xfs exportfs sr_mod cdrom sd_mod bnx2x 
tg3 ahci mdio libahci ptp pps_core hpsa libcrc32c dm_mirror dm_region_hash 
dm_log dm_mod ipv6 autofs4 [last unloaded: pmem]
[62117.317233] CPU: 31 PID: 22135 Comm: systemd-udevd Tainted: G  D 
4.0.0-rc6+ #7
[62117.317234] Hardware name: HP ProLiant DL380 Gen9
[62117.317235]  88047f3f3ac0 8804241db2e8 815a8866 
ff86ff86
[62117.317236]  8804241dbfd8 8804241db2f8 815a4b45 
8804241db348
[62117.317237]  815ab893 880457091050 88047f3fbb20 

[62117.317237] Call Trace:
[62117.317240]  [] dump_stack+0x45/0x57
[62117.317245]  [] __schedule_bug+0x46/0x54
[62117.317247]  [] __schedule+0x793/0x870
[62117.317251]  [] ? bit_wait+0x50/0x50
[62117.317252]  [] schedule+0x37/0x90
[62117.317253]  [] schedule_timeout+0x1dc/0x260
[62117.317258]  [] ? ktime_get+0x3e/0xa0
[62117.317259]  [] io_schedule_timeout+0xac/0x140
[62117.317261]  [] bit_wait_io+0x36/0x50
[62117.317262]  [] __wait_on_bit_lock+0x4b/0xb0
[62117.317263]  [] ? find_get_entries+0xe2/0x130
[62117.317265]  [] __lock_page+0xac/0xb0
[62117.317269]  [] ? autoremove_wake_function+0x40/0x40
[62117.317276]  [] truncate_inode_pages_range+0x3af/0x620
[62117.317278]  [] ? cpumask_next_and+0x37/0x50
[62117.317279]  [] ? __brelse+0x40/0x40
[62117.317283]  [] ? smp_call_function_many+0x5d/0x280
[62117.317284]  [] ? free_cpumask_var+0x9/0x10
[62117.317285]  [] ? on_each_cpu_cond+0xbd/0x160
[62117.317286]  [] ? __brelse+0x40/0x40
[62117.317288]  [] truncate_inode_pages+0x15/0x20
[62117.317289]  [] kill_bdev+0x33/0x40
[62117.317291]  [] __blkdev_put+0x68/0x210
[62117.317293]  [] blkdev_put+0x50/0x130
[62117.317294]  [] blkdev_close+0x25/0x30
[62117.317296]  [] __fput+0xe7/0x220
[62117.317298]  [] fput+0xe/0x10
[62117.317302]  [] task_work_run+0xc4/0xe0
[62117.317306]  [] do_exit+0x2d8/0xb10
[62117.317308]  [] ? kmsg_dump+0x9c/0xc0
[62117.317312]  [] oops_end+0x8e/0xd0
[62117.317313]  [] no_context+0x2d4/0x334
[62117.317314]  [] __bad_area_nosemaphore+0x6d/0x1c6
[62117.317317]  [] ? zone_statistics+0x80/0xa0
[62117.317319]  [] bad_area_nosemaphore+0x13/0x15
[62117.317321]  [] __do_page_fault+0x91/0x430
[62117.317322]  [] do_page_fault+0xc/0x10
[62117.317324]  [] page_fault+0x22/0x30
[62117.317325]  [] ? pmem_do_bvec.isra.6+0x212/0x3f0 [pmem]
[62117.317326]  [] pmem_rw_page+0x43/0x60 [pmem]
[62117.317328]  [] ? __radix_tree_preload+0x38/0xa0
[62117.317329]  [] bdev_read_page+0x2e/0x40
[62117.317330]  [] do_mpage_readpage+0x51f/0x6c0
[62117.317331]  [] ? lru_cache_add+0xe/0x10
[62117.317332]  [] mpage_readpages+0xdb/0x130
[62117.317333]  [] ? I_BDEV+0x10/0x10
[62117.317334]  [] ? I_BDEV+0x10/0x10
[62117.317336]  [] blkdev_readpages+0x1d/0x20
[62117.317336]  [] __do_page_cache_readahead+0x194/0x210
[62117.317337]  [] force_page_cache_readahead+0x75/0xb0
[62117.317338]  [] page_cache_sync_readahead+0x43/0x50
[62117.317339]  [] generic_file_read_iter+0x431/0x630
[62117.317341]  [] blkdev_read_iter+0x37/0x40
[62117.317342]  [] new_sync_read+0x7e/0xb0
[62117.317343]  [] __vfs_read+0x18/0x50
[62117.317344]  [] vfs_read+0x86/0x140
[62117.317345]  [] SyS_read+0x46/0xb0
[62117.317346]  [] ? __audit_syscall_entry+0xb4/0x110
[62117.317348]  [] system_call_fastpath+0x12/0x17
[62121.618505] note: systemd-udevd[22133] exited with preempt_count 1


--
To unsubscr

RE: another pmem variant V2

2015-03-31 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 3:33 AM
> To: linux-nvd...@ml01.01.org; linux-fsde...@vger.kernel.org; linux-
> ker...@vger.kernel.org; x...@kernel.org
> Cc: ross.zwis...@linux.intel.com; ax...@kernel.dk; b...@plexistor.com
> Subject: another pmem variant V2
> 
> Here is another version of the same trivial pmem driver, because two
> obviously aren't enough.  The first patch is the same pmem driver
> that Ross posted a short time ago, just modified to use platform_devices
> to find the persistant memory region instead of hardconding it in the
> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> but still allow auto-discovery.
> 
...
> This has been tested both with a real NVDIMM on a system with a type 12
> capable bios, as well as with "fake persistent" memory using the memmap=
> option.
> 
> Changes since V1:
>   - s/E820_PROTECTED_KERN/E820_PMEM/g
>   - map the persistent memory as uncached
>   - better kernel parameter description
>   - various typo fixes
>   - MODULE_LICENSE fix

I used fio to test 4 KiB random read and write IOPS 
on a 2-socket x86 DDR4 system.  With various cache attributes:

attrreadwrite   notes
-   -
UC  37 K21 Kioremap_nocache
WB  3.6 M   2.5 M   ioremap
WC  764 K   3.7 M   ioremap_wc
WT  ioremap_wt

So, although UC and WT are the only modes certain to be safe,
the V1 default of UC provides abysmal performance - worse than
a consumer-class SATA SSD.

A solution for x86 is to use the MOVNTI instruction in WB
mode. This non-temporal hint uses a buffer like the write
combining buffer, not filling the cache and not stopping
everything in the CPU.  The kernel function __copy_from_user() 
uses that instruction (with SFENCE at the end) - see
arch/x86/lib/copy_user_nocache_64.S.

If I made the change from memcpy() to __copy_from_user()
correctly, that results in:

attrreadwrite   notes
-   -
WB w/NTI2.4 M   2.6 M   __copy_from_user()
WC w/NTI3.2 M   2.1 M   __copy_from_user()

There is also a non-temporal streaming load hint instruction
called MOVNTDQA that might be helpful for reads for both WB
and WC. I don't see any existing kernel memcpy-like function 
that utilizes this instruction, so haven't tried it yet.


Intel64 and IA-32 Architectures 
Software Developers Manual excerpts (Jan 2015)
===
"The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ,
MOVNTPS, and MOVNTPD) allow data to be moved from the
processor's registers directly into system memory without
being also written into the L1, L2, and/or L3 caches. These
instructions can be used to prevent cache pollution when
operating on data that is going to be modified only once
before being stored back into system memory. ...

MOVNTI
...
The non-temporal hint is implemented by using a write
combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor
does not write the data into the cache hierarchy,
nor does it fetch the corresponding cache line from
memory into the cache hierarchy.
...

MOVNTDQA Provides a non-temporal hint that can cause
adjacent 16-byte items within an aligned 64-byte region
(a streaming line) to be fetched and held in a small
set of temporary buffers ("streaming load buffers"). 
Subsequent streaming loads to other aligned 16-byte 
items in the same streaming line may be supplied from
the streaming load buffer and can improve throughput.
...
A processor implementation may make use of the 
non-temporal hint associated with this instruction if
the memory source is WC (write combining) memory type. 
An implementation may also make use of the non-temporal
hint associated with this instruction if the memory
source is WB (writeback) memory type."


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 2/3] x86: add a is_e820_ram() helper

2015-03-26 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 11:43 AM
> To: Boaz Harrosh
> Cc: Christoph Hellwig; Ingo Molnar; linux-nvd...@ml01.01.org; linux-
> fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; x...@kernel.org;
> ross.zwis...@linux.intel.com; ax...@kernel.dk
> Subject: Re: [PATCH 2/3] x86: add a is_e820_ram() helper
> 
> On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
> > > + memmap=nn[KMG]!ss[KMG]
> > > + [KNL,X86] Mark specific memory as protected.
> > > + Region of memory to be used, from ss to ss+nn.
> > > + The memory region may be marked as e820 type 12 (0xc)
> > > + and is NVDIMM or ADR memory.
> > > +
> >
> > Do we need to escape "\!" this character on grub command line ? It might
> > help to note that. I did like the original "|" BTW
> 
> No need to escape it on the kvm command line, which is where I tested
> this flag only so far.  If there is a strong argument for "|" I'm happy
> to change it.

I agree with Boaz that ! is a nuisance if loading pmem as a module
with modprobe from bash. 

The ! does work fine in the grub2 command line if the driver is 
built-in.  I added it to /etc/default/grub like this:
GRUB_CMDLINE_LINUX=" memmap=8G!18G"
and grub-mkconfig placed it in /boot/efi/EFI/centos/grub.cfg
with no issues.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/8] pmem: Initial version of persistent memory driver

2015-03-25 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Andy Lutomirski
> Sent: Wednesday, March 18, 2015 1:07 PM
> To: Boaz Harrosh
> Cc: Matthew Wilcox; Ross Zwisler; X86 ML; Thomas Gleixner; Dan Williams;
> Ingo Molnar; Roger C. Pao; linux-nvdimm; linux-kernel; H. Peter Anvin;
> Christoph Hellwig
> Subject: Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
> 
> On Mar 9, 2015 8:20 AM, "Boaz Harrosh"  wrote:
> >
> > On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
> > <>
> > >
> > > I think it would be nice to have control over the caching mode.
> > > Depending on the application, WT or UC could make more sense.
> > >
> >
> > Patches are welcome. say
> > map=sss@aaa:WT,sss@aaa:CA, ...
> >
> > But for us, with direct_access(), all benchmarks show a slight advantage
> > for the cached mode.
> 
> I'm sure cached is faster.  The question is: who flushes the cache?
> 
> --Andy

Nobody.

Therefore, pmem as currently proposed (mapping the memory with
ioremap_cache, which uses _PAGE_CACHE_MODE_WB) is unsafe unless the
system is doing something special to ensure L1, L2, and L3 caches are
flushed on power loss.

I think pmem needs to map the memory as UC or WT by default, providing
WB and WC only as an option for users confident that those attributes
are safe to use in their system.

Even using UC or WT presumes that ADR is in place.

RE: [PATCH 3/3] x86: add support for the non-standard protected e820 type

2015-03-25 Thread Elliott, Robert (Server Storage)

A few editorial nits follow...

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Wednesday, March 25, 2015 11:04 AM
> Subject: [PATCH 3/3] x86: add support for the non-standard protected e820
> type
> 
> Various recent bioses support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.

If this goes into the kernel, I think someone should request that the
ACPI specification mark the value 12 as permanently tainted.  Otherwise
they could assign it to some new meaning that conflicts with all
of this.

A few editorial nits follow...

> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-
> parameters.txt
> index bfcb1a6..98eeaca 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1965,6 +1965,12 @@ bytes respectively. Such letter suffixes can also be
> entirely omitted.
>or
>memmap=0x1$0x1869
> 
> + memmap=nn[KMG]!ss[KMG]
> + [KNL,X86] Mark specific memory as protected.
> + Region of memory to be used, from ss to ss+nn.
> + The memory region may be marked as type 12 and
> + is NVDIMM or ADR memory.

It can be confusing that E820h type values differ from UEFI 
memory map type values, so it might be worth emphasizing that is 
an E820h type value.

Showing hex alongside would also clarify that it is indeed a 
decimal 12.

Suggestion: "e820 type 12 (0xc)"

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b7d31ca..93a27e4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1430,6 +1430,19 @@ config ILLEGAL_POINTER_VALUE
> 
>  source "mm/Kconfig"
> 
> +config X86_PMEM_LEGACY
> + bool "Support non-stanard NVDIMMs and ADR protected memory"

stanard s/b standard


> + help
> +   Treat memory marked using the non-stard e820 type of 12 as used

stard s/b standard

> +   by the Intel Sandy Bridge-EP reference BIOS as protected memory.
> +   The kernel will the offer these regions to the pmem driver so

the s/b then

> +   they can be used for persistent storage.
> +
> +   If you say N the kernel will treat the ADR region like an e820
> +   reserved region.
> +
> +   Say Y if unsure
> +

...
> diff --git a/arch/x86/include/uapi/asm/e820.h
> b/arch/x86/include/uapi/asm/e820.h
> index d993e33..e040950 100644
> --- a/arch/x86/include/uapi/asm/e820.h
> +++ b/arch/x86/include/uapi/asm/e820.h
> @@ -33,6 +33,16 @@
>  #define E820_NVS 4
>  #define E820_UNUSABLE5
> 
> +/*
> + * This is a non-standardized way to represent ADR or NVDIMM regions that
> + * persist over a reboot.  The kernel will ignore their special
> capabilities
> + * unless the CONFIG_X86_PMEM_LEGACY option is set.
> + *
> + * Note that older platforms also used 6 for the same type of memory,
> + * but newer versions switched to 12 as 6 was assigned differently.  Some
> + * time they will learn..
> + */
> +#define E820_PROTECTED_KERN  12

The p in pmem means persistent, not protected.  To me, protected sounds 
like a security feature.  I suggest using a different macro name and 
text strings.

...
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index de088e3..8c6a976 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -48,10 +48,22 @@ unsigned long pci_mem_start = 0xaeedbabe;
>  EXPORT_SYMBOL(pci_mem_start);
>  #endif
> 
> +/*
> + * Memory protected by the system ADR (asynchronous dram refresh)
> + * mechanism is accounted as ram for purposes of establishing max_pfn
> + * and mem_map.
> + */

ADR doesn't really protect or ensure persistence; it just puts memory 
into self-refresh mode. Batteries/capacitors and other logic is what
provides the persistence.

...
> @@ -154,6 +166,9 @@ static void __init e820_print_type(u32 type)
>   case E820_UNUSABLE:
>   printk(KERN_CONT "unusable");
>   break;
> + case E820_PROTECTED_KERN:
> + printk(KERN_CONT "protected (type %u)\n", type);
> + break;

Same "protect" comment applies there, and a few other places in
the patch not excerpted.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] [RESEND] aic7xxx: replace kmalloc/memset by kzalloc

2015-03-24 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Joe Perches [mailto:j...@perches.com]
> Sent: Tuesday, March 24, 2015 3:57 PM
> To: Michael Opdenacker
> Cc: Hannes Reinecke; jbottom...@parallels.com; Elliott, Robert (Server
> Storage); linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] [RESEND] aic7xxx: replace kmalloc/memset by kzalloc
> 
> On Tue, 2015-03-24 at 13:46 -0700, Michael Opdenacker wrote:
...
> > On 03/22/2015 11:59 PM, Hannes Reinecke wrote:
> > > On 03/22/2015 05:31 PM, Michael Opdenacker wrote:
> > >> This replaces kmalloc + memset by a call to kzalloc
> > >> (or kcalloc when appropriate, which zeroes memory too)
> > >>
...
> > I'm sending a version that reverts the use of kcalloc() instead of
> > kzalloc(). For reasons I don't understand, I didn't see the end of
> > Robert Elliott's comment that the use of kcalloc() could prevent the
> > compiler from detecting an overflow.
> 
> I'm confused.  I don't see that comment either, but
> the entire point of kcalloc is to prevent overflows
> by returning NULL when an overflow might occur.

It was a reply to the original post on 2014-10-16, not the resend
this month. 

>From http://permalink.gmane.org/gmane.linux.kernel/1808168:

kcalloc is helpful when one of the values is a variable that 
might cause the multiply to overflow during runtime.  Here, 
two constants are being multiplied together, which can
be done and checked by the compiler at compile time.  

Since kcalloc and kmalloc_array are both static inline 
functions:
static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
{
if (size != 0 && n > SIZE_MAX / size)
return NULL;
return __kmalloc(n * size, flags);
}
static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
{
return kmalloc_array(n, size, flags | __GFP_ZERO);
}

a compiler that detects an overflow will probably just reduce
that to an inlined "return NULL."

BUILD_BUG_ON could be used to trigger a compile-time error,
instead of building a kernel that returns a run-time error.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RESEND] [PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-01-29 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Arnd Bergmann
> Sent: Thursday, 29 January, 2015 4:03 AM
> To: Darrick J. Wong
> Cc: Jens Axboe; Christoph Hellwig; linux-kernel@vger.kernel.org; linux-
> fsde...@vger.kernel.org; linux-...@vger.kernel.org; Jeff Layton; J. Bruce
> Fields
> Subject: Re: [RESEND] [PATCH] block: create ioctl to discard-or-zeroout a
> range of blocks
> 
> On Wednesday 28 January 2015 18:00:25 Darrick J. Wong wrote:
> > Create a new ioctl to expose the block layer's newfound ability to
> > issue either a zeroing discard, a WRITE SAME with a zero page, or a
> > regular write with the zero page.  This BLKZEROOUT2 ioctl takes
> > {start, length, flags} as parameters.  So far, the only flag available
> > is to enable the zeroing discard part -- without it, the call invokes
> > the old BLKZEROOUT behavior.  start and length have the same meaning
> > as in BLKZEROOUT.
> >
> > Furthermore, because BLKZEROOUT2 issues commands directly to the
> > storage device, we must invalidate the page cache (as a regular
> > O_DIRECT write would do) to avoid returning stale cache contents at a
> > later time.
> >
> > This patch depends on "block: Add discard flag to
> > blkdev_issue_zeroout() function" in Jens' for-3.20/core branch.
> >
> > Signed-off-by: Darrick J. Wong 
> >
> 
> Would this work ok for devices that fill discarded areas with all-ones
> instead of all-zeroes? I believe SD cards can do either.
> 
>   Arnd

RAID software stacks need predictable data, and 0 is a lot easier to
deal with than 0xff in Galois field (XOR-based) math.

The SCSI and ATA standards both define bits to report if the device is
designed to return 0 after discard; they don't have a way to report
that any other particular non-zero value is returned.

---
Rob ElliottHP Server Storage


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v3 1/2] x86: Add support for the pcommit instruction

2015-01-28 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Ross Zwisler
> Sent: Tuesday, 27 January, 2015 10:54 AM
> To: linux-kernel@vger.kernel.org
> Cc: Ross Zwisler; H Peter Anvin; Ingo Molnar; Thomas Gleixner; Borislav
> Petkov
> Subject: [PATCH v3 1/2] x86: Add support for the pcommit instruction
> 
> Add support for the new pcommit (persistent commit) instruction.  This
> instruction was announced in the document "Intel Architecture
> Instruction Set Extensions Programming Reference" with reference number
> 319433-022.
> 
> https://software.intel.com/sites/default/files/managed/0d/53/319433-
> 022.pdf
> 
...
> ---
>  arch/x86/include/asm/cpufeature.h| 1 +
>  arch/x86/include/asm/special_insns.h | 8 
>  2 files changed, 9 insertions(+)

Should this patch series also add defines for the virtual 
machine control data structure changes?

1. Add the new VM-Execution Controls bit 21 as
SECONDARY_EXEC_PCOMMIT_EXITING 0x0020
to arch/x86/include/asm/vmx.h.

2. Add the new exit reason of 64 (0x41) as
EXIT_REASON_PCOMMIT  64
to arch/x86/include/uapi/asm/vmx.h and (with a
VMX_EXIT_REASONS string) to usr/include/asm/vmx.h.

3. Add a kvm_vmx_exit_handler to arch/x86/kvm/vmx.c.


---
Rob ElliottHP Server Storage


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 09/22] [SCSI] mpt2sas, mpt3sas: Added a support to set cpu affinity for each MSIX vector enabled by the HBA

2014-12-10 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Sreekanth Reddy
> Sent: Tuesday, 09 December, 2014 6:17 AM
> To: martin.peter...@oracle.com; j...@kernel.org; h...@infradead.org
...
> Change_set:
> 1. Added affinity_hint varable of type cpumask_var_t in adapter_reply_queue
>structure. And allocated a memory for this varable by calling
> zalloc_cpumask_var.
> 2. Call the API irq_set_affinity_hint for each MSIx vector to affiniate it
>with calculated cpus at driver inilization time.
> 3. While freeing the MSIX vector, call this same API to release the cpu
> affinity mask
>for each MSIx vector by providing the NULL value in cpumask argument.
> 4. then call the free_cpumask_var API to free the memory allocated in step 2.
> 
...

> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c
> b/drivers/scsi/mpt3sas/mpt3sas_base.c
> index 1560115..f0f8ba0 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_base.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
...
> @@ -1609,6 +1611,10 @@ _base_request_irq(struct MPT3SAS_ADAPTER *ioc, u8
> index, u32 vector)
>   reply_q->ioc = ioc;
>   reply_q->msix_index = index;
>   reply_q->vector = vector;
> +
> + if (!zalloc_cpumask_var(&reply_q->affinity_hint, GFP_KERNEL))
> + return -ENOMEM;

I think this will create the problem Alex Thorlton just reported
with lpfc on a system with a huge number (6144) of CPUs.

See this thread:
[BUG] kzalloc overflow in lpfc driver on 6k core system

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [BUG] kzalloc overflow in lpfc driver on 6k core system

2014-12-02 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Alex Thorlton
> Sent: Tuesday, 02 December, 2014 3:58 PM
...
> We've recently upgraded our big machine up to 6144 cores, and we're
> shaking out a number of bugs related to booting at that large core
> count.  Last night I tripped a warning from the lpfc driver that appears
> to be related to a kzalloc that uses the number of cores as part of it's
> size calculation.  Here's the backtrace from the warning:
...
> For a little bit more information on exactly what's going wrong, we're
> tripping the warning from lpfc_pci_probe_one_s4 (as you can see from the
> trace).  That function calls down to lpfc_sli4_driver_resource_setup,
> which contains the failing kzalloc here:
> 
> phba->sli4_hba.cpu_map = kzalloc((sizeof(struct lpfc_vector_map_info) *
>  phba->sli4_hba.num_present_cpu),
>  GFP_KERNEL);
> 
> As mentioned, it looks like we're multiplying the number available cpus
> by that struct size to get an allocation size, which ends up being
> greater than KMALLOC_MAX_SIZE.
> 
> Does anyone have any ideas on what could be done to break that
> allocation up into smaller pieces, or to make it in a different way so
> that we avoid this warning?
> 
> Any help is greatly appreciated.  Thanks!
> 

That structure includes an NR_CPU-based maskbits field, which is
probably too big.

include/cpumask.h:
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;

drivers/scsi/lpfc/lpfc_sli4.h:
struct lpfc_vector_map_info {
uint16_tphys_id;
uint16_tcore_id;
uint16_tirq;
uint16_tchannel_id;
struct cpumask  maskbits;
};

maskbits appears to only be used for setting IRQ affinity hints in
drivers/scsi/lpfc_init.c:
for (idx = 0; idx < vectors; idx++) {
cpup = phba->sli4_hba.cpu_map;
cpu = lpfc_find_next_cpu(phba, phys_id);
...
mask = &cpup->maskbits;
cpumask_clear(mask);
cpumask_set_cpu(cpu, mask);
i = irq_set_affinity_hint(phba->sli4_hba.msix_entries[idx].
  vector, mask);

In similar code, mpt3sas and lockless hpsa just call get_cpu_mask()
inside the loop:
cpu = cpumask_first(cpu_online_mask);
for (i = 0; i < h->msix_vector; i++) {
rc = irq_set_affinity_hint(h->intr[i], get_cpu_mask(cpu));
cpu = cpumask_next(cpu, cpu_online_mask);
}

get_cpu_mask() uses the global cpu_bit_bitmap array, which is declared
in kernel/cpu.c:
extern const unsigned long
cpu_bit_bitmap[BITS_PER_LONG+1][BITS_TO_LONGS(NR_CPUS)];

That approach should work for lpfc.

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH RESEND v8] iopoll: Introduce memory-mapped IO polling macros

2014-11-24 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Mitchel Humpherys [mailto:mitch...@codeaurora.org]
> Sent: Monday, November 24, 2014 7:21 PM
> To: Elliott, Robert (Server Storage)
...
> > The hpsa SCSI driver used to use usleep_range in a loop like
> > that, but we found that it caused scheduler problems during
> > boots because it uses TASK_UNINTERRUPTIBLE:
> > [9.260668] [sched_delayed] sched: RT throttling activated
> >
> > msleep() worked much better.

Sorry, that might have been msleep_interruptible() - the fixes
are still not upstream, so I'll have to doublecheck tomorrow.

> Hmm, maybe you were just sleeping for too long?  According to
> Documentation/timers/timers-howto.txt, usleep_range is what should be
> used for non-atomic sleeps in the range [10us, 20ms].  Plus we need
> microsecond granularity anyways, so msleep wouldn't cut it.

The intervals and the overall number of loops were indeed long.  I 
think the corrected code waits a total of 1 second; before the fix,
600 seconds.

> If there are any potential users of these macros that would want to
> sleep for more than 20ms I guess we could add a special case here to use
> msleep when sleep_us exceeds 20,000 or so.
...

Maybe just a comment in the documentation for the macro would suffice,
possibly with a separate macro for longer interval sleeps.  I don't
want to force an extra msleep() call to be compiled into a bunch
of places where it's not needed.


---
Rob ElliottHP Server Storage


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH RESEND v8] iopoll: Introduce memory-mapped IO polling macros

2014-11-24 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Mitchel Humpherys
> Sent: Monday, 24 November, 2014 2:15 PM
...
> From: Matt Wagantall 
> 
> It is sometimes necessary to poll a memory-mapped register until its value
> satisfies some condition. Introduce a family of convenience macros that do
> this. Tight-looping, sleeping, and timing out can all be accomplished
> using these macros.
> 
...
> +#define readx_poll_timeout(op, addr, val, cond, sleep_us, timeout_us)
> \
> +({ \
> + ktime_t timeout = ktime_add_us(ktime_get(), timeout_us); \
> + might_sleep_if(sleep_us); \
> + for (;;) { \
> + (val) = op(addr); \
> + if (cond) \
> + break; \
> + if (timeout_us && ktime_compare(ktime_get(), timeout) > 0) {
> \
> + (val) = op(addr); \
> + break; \
> + } \
> + if (sleep_us) \
> + usleep_range((sleep_us >> 2) + 1, sleep_us); \

The hpsa SCSI driver used to use usleep_range in a loop like 
that, but we found that it caused scheduler problems during
boots because it uses TASK_UNINTERRUPTIBLE:
[9.260668] [sched_delayed] sched: RT throttling activated

msleep() worked much better.

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 06/10] libata: use __scsi_format_command()

2014-11-14 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Thursday, 06 November, 2014 2:31 AM
> To: James Bottomley
> Cc: Christoph Hellwig; Ewan Milne; Elliott, Robert (Server Storage);
> linux-s...@vger.kernel.org; Hannes Reinecke; linux-
> i...@vger.kernel.org; LKML
> Subject: [PATCH 06/10] libata: use __scsi_format_command()
> 
> libata already uses an internal buffer, so we should be using
> __scsi_format_command() here.
> 
> Cc: linux-...@vger.kernel.org
> Cc: LKML 
> Reviewed-by: Christoph Hellwig 
> Acked-by: Tejun Heo 
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/ata/libata-eh.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index dad83df..6550a9a 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -2509,15 +2509,12 @@ static void ata_eh_link_report(struct
> ata_link *link)
> 
>   if (ata_is_atapi(qc->tf.protocol)) {
>   if (qc->scsicmd)
> - scsi_print_command(qc->scsicmd);
> + __scsi_format_command(cdb_buf,
> sizeof(cdb_buf),
> +   qc->scsicmd->cmnd,
> +   qc->scsicmd->cmd_len);
>   else
> - snprintf(cdb_buf, sizeof(cdb_buf),
> -  "cdb %02x %02x %02x %02x %02x %02x %02x
> %02x  "
> -  "%02x %02x %02x %02x %02x %02x %02x %02x\n
> ",
> -  cdb[0], cdb[1], cdb[2], cdb[3],
> -  cdb[4], cdb[5], cdb[6], cdb[7],
> -  cdb[8], cdb[9], cdb[10], cdb[11],
> -  cdb[12], cdb[13], cdb[14], cdb[15]);
> + __scsi_format_command(cdb_buf,
> sizeof(cdb_buf),
> +   cdb, qc->dev->cdb_len);

Since results in just one "cdb" variable usage, you could change
"cdb" to qc->cdb" to get rid of the variable declaration and 
slightly simplify the code.
const u8 *cdb = qc->cdb;


>   } else {
>   const char *descr = ata_get_cmd_descript(cmd-
> >command);
>   if (descr)
> --
> 1.8.5.2

Reviewed-by: Robert Elliott 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: BUG in scsi_lib.c due to a bad commit

2014-11-12 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Barto
> Sent: Wednesday, November 12, 2014 9:28 PM
> To: Guenter Roeck; Bjorn Helgaas
> Cc: linux-kernel@vger.kernel.org; linux-s...@vger.kernel.org; Joe
> Perches
> Subject: Re: BUG in scsi_lib.c due to a bad commit
> 
> reverting your commit 045065d8a300a37218c is a solution, but it's just a
> temporary solution,
> 
> it's better to search why your commit can create a random hang on boot
> on some PC configurations,
> 
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue
> *q)
> blk_requeue_request(q, req);
> atomic_dec(&sdev->device_busy);
> out_delay:
> - if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> + if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> }
> 
> perhaps the atomic_read() function doesn't make the expected job on some
> rare circonstances, I have the same doubts about the blk_delay_queue()
> function

Were you running with scsi_mod.use_blk_mq=Y or =N?

device_busy is the active queue depth for the device (e.g.
5 means there are 5 commands submitted but not yet completed).

The function reaches this code if it has run out of tags, the host
has reached its limit of outstanding commands, or the target has
reached its limit.  It requeus the request:
* with delay if device_busy is zero
* without delay if device_busy is non_zero

I think this is the reasoning:
If device_busy is zero, trying to process the request again will
probably run into the same problem; a delay gives time for the
situation to change.  If device_busy is non-zero, then the 
requeued command goes behind others and might get a different
result.

With the polarity backwards, the lack of delay hung PA-RISC 
and SPARC64 systems), not just QEMU.  So, I don't think reverting
the fix is good.

Changing it to an unconditional delay might be safe - delay
regardless of device_busy (until the root cause is understood).

Also, SCSI_QUEUE_DELAY seems like an arbitrary magic number; 
maybe that value isn't working correctly anymore?

RE: absurdly high "optimal_io_size" on Seagate SAS disk

2014-11-07 Thread Elliott, Robert (Server Storage)

> commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50
> Author: Martin K. Petersen 
> Date:   Thu Nov 6 12:31:43 2014 -0500
> 
> [SCSI] sd: Sanity check the optimal I/O size
> 
> We have come across a couple of devices that report crackpot
>   values in the optimal I/O size in the Block Limits VPD page. 
>   Since this is a 32-bit entity that gets multiplied by the
>   logical block size we can get
> disproportionately large values reported to the block layer.
> 
> Cap io_opt at 1 GB.

Another reasonable cap is the maximum transfer size.
There are lots of them:

* the block layer BIO_MAX_PAGES value of 256 limits IOs
  to a maximum of 1 MiB
* SCSI LLDs report their maximum transfer size in
  /sys/block/sdNN/queue/max_hw_sectors_kb
* the SCSI midlayer maximum transfer size is set/reported
  in /sys/block/sdNN/queue/max_sectors_kb
  and the default is 512 KiB
* the SCSI LLD maximum number of scatter gather entries
  reported in /sys/block/sdNN/queue/max_segments and
  /sys/block/sdNN/queue/max_segment_size creates a
  limit based on how fragmented the data buffer is
  in virtual memory
* the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
  indicates the maximum transfer size for one command over
  the SCSI transport protocol supported by the drive itself

It is risky to use transfer sizes larger than linux and
Windows can generate, since drives are probably tested in
those environments.

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/4] pmem: Initial version of persistent memory driver

2014-11-04 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Boaz Harrosh [mailto:b...@plexistor.com]
> Sent: Tuesday, 04 November, 2014 4:38 AM
> To: Wilcox, Matthew R; Elliott, Robert (Server Storage); Ross
> Zwisler; Jens Axboe; Nick Piggin; Kani, Toshimitsu; Knippers, Linda;
> linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> nvd...@lists.01.org; Matthew Wilcox
> Subject: Re: [PATCH 1/4] pmem: Initial version of persistent memory
> driver
> 
> On 11/03/2014 06:19 PM, Wilcox, Matthew R wrote:
...
> 
> I wish you guys would actually review the correct code.
> 
> In the actual good driver that has any shape of proper code all these
> issue are gone.
> 
> * config defaults gone, multiple-devices multiple-memory ranges fully
>supported hot plug style.
> * above shifts cruft completely gone it is left overs from brd.c and
>   its page usage.
> * getgeo fixed to do what we realy want by the only application on earth
>   that still uses it, fdisk. All other partitioners do not call it at
>   all.
> 
> Why are we reviewing dead code ?
> 
> Cheers
> Boaz

Ross, what's the status of Boaz' patches (available in
git://git.open-osd.org/pmem.git)?

https://github.com/01org/prd.git doesn't include any of 
them yet.

RE: [PATCH v4 4/7] x86, mm, pat: Add pgprot_writethrough() for WT

2014-11-03 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Andy Lutomirski [mailto:l...@amacapital.net]
> Sent: Monday, November 03, 2014 5:01 PM
> To: Thomas Gleixner
> Cc: Kani, Toshimitsu; Elliott, Robert (Server Storage); h...@zytor.com;
> mi...@redhat.com; a...@linux-foundation.org; a...@arndb.de; linux-
> m...@kvack.org; linux-kernel@vger.kernel.org; jgr...@suse.com;
> stefan.ba...@canonical.com; h...@hmh.eng.br; yi...@plexistor.com;
> konrad.w...@oracle.com
> Subject: Re: [PATCH v4 4/7] x86, mm, pat: Add pgprot_writethrough() for
> WT
> 
> On Mon, Nov 3, 2014 at 2:53 PM, Thomas Gleixner 
> wrote:
...
> On the other hand, I thought that _GPL was supposed to be more about
> whether the thing using it is inherently a derived work of the Linux
> kernel.  Since WT is an Intel concept, not a Linux concept, then I
> think that this is a hard argument to make.

IBM System/360 Model 85 (1968) had write-through (i.e., store-through)
caching.  Intel might claim Write Combining, though.



N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH v4 4/7] x86, mm, pat: Add pgprot_writethrough() for WT

2014-11-03 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Kani, Toshimitsu
> Sent: Monday, 27 October, 2014 5:56 PM
> To: h...@zytor.com; t...@linutronix.de; mi...@redhat.com; akpm@linux-
> foundation.org; a...@arndb.de
> Cc: linux...@kvack.org; linux-kernel@vger.kernel.org;
> jgr...@suse.com; stefan.ba...@canonical.com; l...@amacapital.net;
> h...@hmh.eng.br; yi...@plexistor.com; konrad.w...@oracle.com; Kani,
> Toshimitsu
> Subject: [PATCH v4 4/7] x86, mm, pat: Add pgprot_writethrough() for
> WT
> 
> This patch adds pgprot_writethrough() for setting WT to a given
> pgprot_t.
> 
> Signed-off-by: Toshi Kani 
> Reviewed-by: Konrad Rzeszutek Wilk 
...
> diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
> index a214f5a..a0264d3 100644
> --- a/arch/x86/mm/pat.c
> +++ b/arch/x86/mm/pat.c
> @@ -896,6 +896,16 @@ pgprot_t pgprot_writecombine(pgprot_t prot)
>  }
>  EXPORT_SYMBOL_GPL(pgprot_writecombine);
> 
> +pgprot_t pgprot_writethrough(pgprot_t prot)
> +{
> + if (pat_enabled)
> + return __pgprot(pgprot_val(prot) |
> + cachemode2protval(_PAGE_CACHE_MODE_WT));
> + else
> + return pgprot_noncached(prot);
> +}
> +EXPORT_SYMBOL_GPL(pgprot_writethrough);
...

Would you be willing to use EXPORT_SYMBOL for the new 
pgprot_writethrough function to provide more flexibility
for modules to utilize the new feature?  In x86/mm, 18 of 60
current exports are GPL and 42 are not GPL.

---
Rob ElliottHP Server Storage


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 2/4] pmem: Add support for getgeo()

2014-11-01 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Ross Zwisler
> Sent: Wednesday, 27 August, 2014 4:12 PM
> To: Jens Axboe; Matthew Wilcox; Boaz Harrosh; Nick Piggin; linux-
> fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> nvd...@lists.01.org
> Cc: Ross Zwisler
> Subject: [PATCH 2/4] pmem: Add support for getgeo()
> 
> Some programs require HDIO_GETGEO work, which requires we implement
> getgeo.  Based off of the work done to the NVMe driver in this
> commit:
> 
> commit 4cc09e2dc4cb ("NVMe: Add getgeo to block ops")
> 
> Signed-off-by: Ross Zwisler 
> ---
>  drivers/block/pmem.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> index d366b9b..60bbe0d 100644
> --- a/drivers/block/pmem.c
> +++ b/drivers/block/pmem.c
> @@ -50,6 +50,15 @@ struct pmem_device {
>   size_t  size;
>  };
> 
> +static int pmem_getgeo(struct block_device *bd, struct hd_geometry
> *geo)
> +{
> + /* some standard values */
> + geo->heads = 1 << 6;
> + geo->sectors = 1 << 5;
> + geo->cylinders = get_capacity(bd->bd_disk) >> 11;

Just stuffing the result of get_capacity into the 16-bit 
cylinders field will overflow/wrap on large capacities.
0x << 11 = 0x7FF_F800 = 64 GiB (68.7 GB)

How many programs still need these meaningless fields?
Could the bogus information be created elsewhere so
each block driver doesn't need to do this?


> + return 0;
> +}
> +
>  /*
>   * direct translation from (pmem,sector) => void*
>   * We do not require that sector be page aligned.
> @@ -176,6 +185,7 @@ out:
> 
>  static const struct block_device_operations pmem_fops = {
>   .owner =THIS_MODULE,
> + .getgeo =   pmem_getgeo,
>  };
> 
>  /* Kernel module stuff */
> --


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/4] pmem: Initial version of persistent memory driver

2014-11-01 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Ross Zwisler
> Sent: Wednesday, 27 August, 2014 4:12 PM
> To: Jens Axboe; Matthew Wilcox; Boaz Harrosh; Nick Piggin; linux-
> fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> nvd...@lists.01.org
> Cc: Ross Zwisler
> Subject: [PATCH 1/4] pmem: Initial version of persistent memory
> driver
> 
> PMEM is a new driver that presents a reserved range of memory as a
> block device.  This is useful for developing with NV-DIMMs, and
> can be used with volatile memory as a development platform.
> 
> Signed-off-by: Ross Zwisler 
> ---
>  MAINTAINERS|   6 +
>  drivers/block/Kconfig  |  41 ++
>  drivers/block/Makefile |   1 +
>  drivers/block/pmem.c   | 330
> +
>  4 files changed, 378 insertions(+)
>  create mode 100644 drivers/block/pmem.c
> 
...

> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 1b8094d..ac52f5a 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -404,6 +404,47 @@ config BLK_DEV_RAM_DAX
> and will prevent RAM block device backing store memory from
> being
> allocated from highmem (only a problem for highmem systems).
> 
> +config BLK_DEV_PMEM
> + tristate "Persistent memory block device support"
> + help
> +   Saying Y here will allow you to use a contiguous range of
> reserved
> +   memory as one or more block devices.  Memory for PMEM should
> be
> +   reserved using the "memmap" kernel parameter.
> +
> +   To compile this driver as a module, choose M here: the module
> will be
> +   called pmem.
> +
> +   Most normal users won't need this functionality, and can thus
> say N
> +   here.
> +
> +config BLK_DEV_PMEM_START
> + int "Offset in GiB of where to start claiming space"
> + default "0"
> + depends on BLK_DEV_PMEM
> + help
> +   Starting offset in GiB that PMEM should use when claiming
> memory.  This
> +   memory needs to be reserved from the OS at boot time using
> the
> +   "memmap" kernel parameter.
> +
> +   If you provide PMEM with volatile memory it will act as a
> volatile
> +   RAM disk and your data will not be persistent.
> +
> +config BLK_DEV_PMEM_COUNT
> + int "Default number of PMEM disks"
> + default "4"

For real use I think a default of 1 would be better.

> + depends on BLK_DEV_PMEM
> + help
> +   Number of equal sized block devices that PMEM should create.
> +
> +config BLK_DEV_PMEM_SIZE
> + int "Size in GiB of space to claim"
> + depends on BLK_DEV_PMEM
> + default "0"
> + help
> +   Amount of memory in GiB that PMEM should use when creating
> block
> +   devices.  This memory needs to be reserved from the OS at
> +   boot time using the "memmap" kernel parameter.
> +
>  config CDROM_PKTCDVD
>   tristate "Packet writing on CD/DVD media"
>   depends on !UML


...

> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> new file mode 100644
> index 000..d366b9b
> --- /dev/null
> +++ b/drivers/block/pmem.c
> @@ -0,0 +1,330 @@
> +/*
> + * Persistent Memory Driver
> + * Copyright (c) 2014, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or
> modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but
> WITHOUT
> + * ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> License for
> + * more details.
> + *
> + * This driver is heavily based on drivers/block/brd.c.
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define SECTOR_SHIFT 9
> +#define PAGE_SECTORS_SHIFT   (PAGE_SHIFT - SECTOR_SHIFT)
> +#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT)
> +
> +/*
> + * driver-wide physical address and total_size - one single,
> contiguous memory
> + * region that we divide up in to same-sized devices
> + */
> +phys_addr_t  phys_addr;
> +void *virt_addr;
> +size_t   total_size;
> +
> +struct pmem_device {
> + struct request_queue*pmem_queue;
> + struct gendisk  *pmem_disk;
> + struct list_headpmem_list;
> +
> + phys_addr_t phys_addr;
> + void*virt_addr;
> + size_t  size;
> +};
> +
> +/*
> + * direct translation from (pmem,sector) => void*
> + * We do not require that sector be page aligned.
> + * The return value will point to the beginning of the page
> containing the
> + * given sector, not to the secto

RE: [PATCH] MAINTAINERS: remove list iss_storage...@hp.com

2014-10-30 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Thursday, October 30, 2014 3:29 AM
> To: Sudip Mukherjee
> Cc: Andrew Morton; Greg Kroah-Hartman; linux-kernel@vger.kernel.org; Don
> Brace; Elliott, Robert (Server Storage)
> Subject: Re: [PATCH] MAINTAINERS: remove list iss_storage...@hp.com
> 
> On Thu, Oct 30, 2014 at 12:17:39PM +0530, Sudip Mukherjee wrote:
> > when sending mail to iss_storage...@hp.com it failed permanently
> >
> > "Google tried to deliver your message, but it was rejected by the
> > server for the recipient domain hp.com by smtp.hp.com.
> > [15.201.208.57].
> > The error that the other server returned was:
> > 550 5.1.1 : Recipient address
> > rejected: User unknown in virtual alias table"
> 
> I've got a MAINTAINERS update for these drivers queued up, but it only
> adds a new maintainer, and an additional list, but doesn't remove this
> one.  Don, Robert - should we remove the list as well, or should it come
> back?

That will be restored.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/1 linux-next] hpsa: remove set but unused variable rc

2014-10-29 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Sudip Mukherjee
> Sent: Thursday, October 30, 2014 12:55 AM
> To: Fabian Frederick
> Cc: linux-kernel@vger.kernel.org; Stephen M. Cameron; James E.J.
> Bottomley; iss_storage...@hp.com; linux-s...@vger.kernel.org
> Subject: Re: [PATCH 1/1 linux-next] hpsa: remove set but unused variable
> rc
> 
> On Wed, Oct 29, 2014 at 04:15:04PM +0100, Fabian Frederick wrote:
> > Fix -Wunused-but-set-variable warning
> 
> you should also mention why you have left the call to
> irq_set_affinity_hint().
> i am not sure , but it looks like irq_set_affinity_hint() is only
> checking if
> the lock is available or not. It is just locking ,then if lock is
> successfull then
> returning 0 or if lock fails then return -EINVAL, and unlocks before
> returnig.
> not doing anything else.
> 
> thanks
> sudip

No, that function sets a mask value that shows up in 
/proc/irq/nnn/affinity_hint
that a program like irqbalance may use to set the CPU affinity mask
for each irq via 
/proc/irq/nnn/smp_affinity   (bitmap format)
/proc/irq/nnn/smp_affinity_list   (range format)

The reason is that in many cases, it is best when all these occur 
on the same CPU that submitted the IO:
* LLD submission queues (if multiple are supported)
* LDD completion queues
* MSI-X interrupt indicating completion
* LLD completion interrupt handler
* block layer completion handler

Benefits include:
* cache efficiency - the data structures for the IO aren't
pulled from CPU to CPU.

* avoid IPI overhead in the block layer to get the completion 
processed on the submitting CPU (which is done if using 
rq_affinity=2 and the interrupt is routed to on another CPU).

* self-throttle the CPUs - avoid swamping one CPU with 
completion processing for IOs submitted by many other CPUs
(which leads to stalls on the victim and timeouts on the
aggressors).

You must run irqbalance with an option to honor the hints;
some versions default to that, others don't.  Or, disable
the irqbalance service and set them up the affinities
manually.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] aic7xxx: replace kmalloc/memset by kzalloc

2014-10-16 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Michael Opdenacker
> Sent: Thursday, 16 October, 2014 2:31 PM
...
> On 10/16/2014 09:28 PM, Joe Perches wrote:
> > On Thu, 2014-10-16 at 21:14 +0200, Michael Opdenacker wrote:
> >
> >
> > /* Allocate SCB resources */
> > -   scb_data->scbarray = kmalloc(sizeof(struct scb) * AHC_SCB_MAX_ALLOC,
> GFP_ATOMIC);
> > +   scb_data->scbarray = kzalloc(sizeof(struct scb) * AHC_SCB_MAX_ALLOC,
> > +   GFP_ATOMIC);
...
> >
> > Probably better as kcalloc.
> 
> Hey, well spotted! Thanks for your review. I will post a new version soon.

kcalloc is helpful when one of the values is a variable that 
might cause the multiply to overflow during runtime.  Here, 
two constants are being multiplied together, which can
be done and checked by the compiler at compile time.  

Since kcalloc and kmalloc_array are both static inline 
functions:
static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
{
if (size != 0 && n > SIZE_MAX / size)
return NULL;
return __kmalloc(n * size, flags);
}
static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
{
return kmalloc_array(n, size, flags | __GFP_ZERO);
}

a compiler that detects an overflow will probably just reduce
that to an inlined "return NULL."

BUILD_BUG_ON could be used to trigger a compile-time error,
instead of building a kernel that returns a run-time error.

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Locking issues with cpufreq and sysfs

2014-10-14 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Prarit Bhargava
> Sent: Tuesday, 14 October, 2014 1:24 PM
> To: Viresh Kumar
> Cc: Saravana Kannan; Rafael J. Wysocki; linux...@vger.kernel.org; Linux
> Kernel; Robert Schöne
> Subject: Re: Locking issues with cpufreq and sysfs
> 
> On 10/14/2014 03:10 AM, Viresh Kumar wrote:
> > On 13 October 2014 18:41, Prarit Bhargava  wrote:
> >>
> >> The locking is insufficient here, Viresh.  I no longer believe that fixes
> >> to this locking scheme are the right way to move forward here.  I'm
> wondering
> >> if we can look at other alternatives such as maintaining a refcount or
> >> perhaps using a queuing mechanism for governor and policy related changes.
> >>
...
> So I'm proposing that we move to a single threaded read/write using, if
> possible, a single policy lock for now.  We might transition this back to a
> rwsem later on, however, for the first attempt at cleaning this up I think we
> should just stick with a simple lock.  In doing that, IMO we remove
> cpufreq_rwsem: protects the driver from being unloaded
> cpufreq_governor_lock: protects the current governor
> each policy has a rwsem (policy->rwsem): protects the cpufreq_policy struct
> 
> and potentially
> cpufreq_driver_lock: protects the cpufreq_cpu_data array and cpufreq_driver-
> >boost
> 
> After looking at the way the code would be structured, I'm wondering if
> cpufreq_governor_mutex: protects the cpufreq_governor_list
> is overkill.  The loading of a module should be atomic relative to the
> cpufreq code, so this lock may not be required.  (Admittedly I haven't
> tested that...)
> 
> That would leave:
> global_kobj_lock: protects the "cpufreq" kobject
> each policy has a transition_lock (policy->transition): synchronizes
> frequency transitions
> and a new lock, perhaps called policy->lock, to serialize all events.
> 

Please keep performance in mind too.  cpufreq_governor_lock
contention is a bit of an issue with heavy IO workloads
as described in:
http://marc.info/?l=linux-pm&m=140924051503827&w=2


---
Rob ElliottHP Server Storage



N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH 4/7 linux-next] cpqarray: remove deprecated IRQF_DISABLED

2014-10-06 Thread Elliott, Robert (Server Storage)


> -Original Message-
> From: Fabian Frederick [mailto:f...@skynet.be]
> Sent: Monday, October 06, 2014 10:36 AM
> To: linux-kernel@vger.kernel.org
> Cc: Greg Kroah-Hartman; Fabian Frederick; ISS StorageDev
> Subject: [PATCH 4/7 linux-next] cpqarray: remove deprecated
> IRQF_DISABLED
> 
> See include/linux/interrupt.h:
> "This flag is a NOOP and scheduled to be removed"
> 
> Signed-off-by: Fabian Frederick 
> ---
>  drivers/block/cpqarray.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/cpqarray.c b/drivers/block/cpqarray.c
> index 2b94403..370721e 100644
> --- a/drivers/block/cpqarray.c
> +++ b/drivers/block/cpqarray.c
> @@ -406,7 +406,7 @@ static int cpqarray_register_ctlr(int i, struct
> pci_dev *pdev)
>   }
>   hba[i]->access.set_intr_mask(hba[i], 0);
>   if (request_irq(hba[i]->intr, do_ida_intr,
> - IRQF_DISABLED|IRQF_SHARED, hba[i]->devname, hba[i]))
> + IRQF_SHARED, hba[i]->devname, hba[i]))
>   {
>   printk(KERN_ERR "cpqarray: Unable to get irq %d for %s\n",
>   hba[i]->intr, hba[i]->devname);
> --
> 1.9.3

Although we're awaiting resolution of MAINTAINERS updates for that
driver, this change was already made to cciss and hpsa several years
ago and seems fine here too.

Acked-by: Robert Elliott 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-10-03 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, 25 September, 2014 7:13 AM
> To: Douglas Gilbert
> Cc: SCSI development list; linux-kernel; James Bottomley; Christoph Hellwig;
> Milan Broz
> Subject: Re: [PATCH] scsi_debug: deadlock between completions and surprise
> module removal
> 
> Review ping again?
> 
> While I think the shutdown code in scsi_debug needs a bit more of an
> overhault I'd really like to include the fix at least for 3.18 and
> 3.17-stable now that we have missed the 3.17 window.
> 
> On Sun, Aug 31, 2014 at 07:09:59PM -0400, Douglas Gilbert wrote:
> > A deadlock has been reported when the completion
> > of SCSI commands (simulated by a timer) was surprised
> > by a module removal. This patch removes one half of
> > the offending locks around timer deletions. This fix
> > is applied both to stop_all_queued() which is were
> > the deadlock was discovered and stop_queued_cmnd()
> > which has very similar logic.
> >
> > This patch should be applied both to the lk 3.17 tree
> > and Christoph's drivers-for-3.18 tree.
> >
> > Tested-and-reported-by: Milan Broz 
> > Signed-off-by: Douglas Gilbert 

Reviewed-by: Robert Elliott 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] block: disable entropy contributions from nonrot devices

2014-10-02 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Mike Snitzer
> Sent: Thursday, 02 October, 2014 7:11 PM
> To: ax...@kernel.dk; linux-kernel@vger.kernel.org
> Cc: ty...@mit.edu; gmazyl...@gmail.com; a...@redhat.com; mpato...@redhat.com
> Subject: [PATCH] block: disable entropy contributions from nonrot devices
> 
> Introduce queue_flags_set_nonrot_clear_add_random() and convert all
> block drivers that set QUEUE_FLAG_NONROT over to using it instead.
> 
> Historically, all block devices have automatically made entropy
> contributions.  But as previously stated in commit e2e1a148 ("block: add
> sysfs knob for turning off disk entropy contributions"):
> - On SSD disks, the completion times aren't as random as they
>   are for rotational drives. So it's questionable whether they
>   should contribute to the random pool in the first place.
> - Calling add_disk_randomness() has a lot of overhead.
> 
> There are more reliable sources for randomness than non-rotational block
> devices.  From a security perspective it is better to err on the side of
> caution than to allow entropy contributions from unreliable "random"
> sources.

blk-mq defaults to off (QUEUE_FLAG_MQ_DEFAULT does not 
include QUEUE_FLAG_ADD_RANDOM).

Even when it's off in block layer completion processing, all interrupts, 
storage or not, are forced to contribute during hardirq processing.
I've seen this add 3-12 us latency blips every 64 interrupts (when 
the "fast_mix" code runs out of bits).

Example of fast_mix only taking 0.071 us:
(from ftrace function_graph)
  8)   |  
handle_irq_event() {
  8)   |
handle_irq_event_percpu() {
  8)   |  
do_hpsa_intr_msi [hpsa]() {
  8)   0.060 us|
SA5_performant_completed [hpsa]();
  8)   0.397 us|  }
  8)   0.071 us|  
add_interrupt_randomness();
  8)   0.071 us|  
note_interrupt();
  8)   1.495 us|}
  8)   0.045 us|
_raw_spin_lock();
  8)   2.165 us|  }

Example of the 64th bit taking 3.814 us:
  8)   |handle_irq_event() {
  8)   |  
handle_irq_event_percpu() {
...
  8)   |
add_interrupt_randomness() {
  8)   |  
__mix_pool_bytes() {
  8)   0.312 us|
_mix_pool_bytes();
  8)   0.688 us|  }
  8)   |  
credit_entropy_bits() {
  8)   |__wake_up() {
  8)   0.070 us|  
_raw_spin_lock_irqsave();
  8)   0.050 us|  
__wake_up_common();
  8)   0.056 us|  
_raw_spin_unlock_irqrestore();
  8)   1.448 us|}
  8)   0.048 us|kill_fasync();
  8)   2.313 us|  }
  8)   3.814 us|}


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] MAINTAINERS: remove iss_storage...@hp.com list

2014-10-02 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Michael Opdenacker
> Sent: Thursday, 02 October, 2014 2:46 AM
> To: a...@linux-foundation.org
> Cc: da...@davemloft.net; gre...@linuxfoundation.org; j...@perches.com;
> m.che...@samsung.com; cr...@iki.fi; jg1@samsung.com; linux-
> ker...@vger.kernel.org; scame...@beardog.cce.hp.com;
> mi...@beardog.cce.hp.com; Michael Opdenacker
> Subject: [PATCH] MAINTAINERS: remove iss_storage...@hp.com list
> 
> This removes the iss_storage...@hp.com mailing list from the MAINTAINERS
> file, as this address is now rejected by the HP mail server.
> 
> Signed-off-by: Michael Opdenacker 
> ---
>  MAINTAINERS | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 37054306dc9f..b0ed53797f43 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4215,14 +4215,12 @@ F:drivers/media/dvb-frontends/hd29l2*
> 
>  HEWLETT-PACKARD SMART2 RAID DRIVER
>  M:   Chirag Kantharia 
> -L:   iss_storage...@hp.com
>  S:   Maintained
>  F:   Documentation/blockdev/cpqarray.txt
>  F:   drivers/block/cpqarray.*
> 
>  HEWLETT-PACKARD SMART ARRAY RAID DRIVER (hpsa)
>  M:   "Stephen M. Cameron" 
> -L:   iss_storage...@hp.com
>  S:   Supported
>  F:   Documentation/scsi/hpsa.txt
>  F:   drivers/scsi/hpsa*.[ch]
> @@ -4231,7 +4229,6 @@ F:  include/uapi/linux/cciss*.h
> 
>  HEWLETT-PACKARD SMART CISS RAID DRIVER (cciss)
>  M:   Mike Miller 
> -L:   iss_storage...@hp.com
>  S:   Supported
>  F:   Documentation/blockdev/cciss.txt
>  F:   drivers/block/cciss*

NAK - that address is still valid; I just tested it from my gmail
account.  In fact, that's the only working email address of all
of those listed.  Chirag, Stephen, and Mike are no longer at HP.

I'll prod PMC-Sierra to send an update to MAINTAINERS.

---
Rob ElliottHP Server Storage




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] MAINTAINERS: fix Mike Miller's e-mail address

2014-10-02 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Michael Opdenacker [mailto:michael.opdenac...@free-electrons.com]
> Sent: Thursday, 02 October, 2014 2:17 AM
> To: a...@linux-foundation.org
> Cc: da...@davemloft.net; gre...@linuxfoundation.org; j...@perches.com;
> m.che...@samsung.com; cr...@iki.fi; jg1@samsung.com; linux-
> ker...@vger.kernel.org; mi...@beardog.cce.hp.com; ISS StorageDev
> Subject: Re: [PATCH] MAINTAINERS: fix Mike Miller's e-mail address
> 
> On 10/02/2014 09:04 AM, Michael Opdenacker wrote:
> > This replaces Mike Miller's e-mail address
> > (now rejected by the HP mail server) by the
> > latest one that he used to submit patches.
> >
> > Checked that this one is not rejected.
> >
> > Signed-off-by: Michael Opdenacker 
> > ---
> >  MAINTAINERS | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 37054306dc9f..6a8677b5ba24 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -4230,7 +4230,7 @@ F:include/linux/cciss*.h
> >  F: include/uapi/linux/cciss*.h
> >
> >  HEWLETT-PACKARD SMART CISS RAID DRIVER (cciss)
> > -M: Mike Miller 
> > +M: Mike Miller 
> >  L: iss_storage...@hp.com
> >  S: Supported
> >  F: Documentation/blockdev/cciss.txt
> 
> Oops, copying Mike Miller and the corresponding list too... Blind trust
> in get_maintainer.pl is evil ;)
> 
> Michael.
> 
> --
> Michael Opdenacker, CEO, Free Electrons
> Embedded Linux, Kernel and Android engineering
> http://free-electrons.com
> +33 484 258 098

NAK - Mike Miller is not at HP any more.  The beardog machine is still
here but the email accounts are not monitored.

PMC-Sierra (copying Scott Benesh) will be picking up maintenance of
hpsa and maybe cciss.  cpqarray should probably be marked unmaintained
or removed from the kernel.

---
Rob ElliottHP Server Storage




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] block: remove artifical max_hw_sectors cap

2014-10-01 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Wednesday, 01 October, 2014 8:08 AM
> To: Jens Axboe; linux-kernel@vger.kernel.org; linux-s...@vger.kernel.org; Wu
> Fengguang
> Subject: Re: [PATCH] block: remove artifical max_hw_sectors cap
> 
> As we still haven't made any progress on this let me explain why
> the limit does not make sense:  It only applies to _FS request,
> which basically have three use cases:
> 
>  - metadata I/O.  Generally small enough that the limit does not
>matter at all.
>  - buffered reads/writes.  We already have a self-tuning algorithm
>that limits writeback size, and a readahead size tunable that
>caps read sizes.  Imposing another confusing limit that does
>not interact with the visible tunables here is not helpful
>  - direct I/O.  Users should get something resembling their request
>as closely as possible on the write, and this is where our
>stupid limitation causes the most problems.

One supporting example: A low limit interferes with creation of
full stripe writes for RAID controllers.



> On Sat, Sep 06, 2014 at 04:08:05PM -0700, Christoph Hellwig wrote:
> > Set max_sectors to the value the drivers provides as hardware limit by
> > default.  Linux had proper I/O throttling for a long time and doesn't
> > rely on a artifically small maximum I/O size anymore.  By not limiting
> > the I/O size by default we remove an annoying tuning step required for
> > most Linux installation.
> >
> > Note that both the user, and if absolutely required the driver can still
> > impose a limit for FS requests below max_hw_sectors_kb.
> >
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  block/blk-settings.c   | 4 +---
> >  drivers/block/aoe/aoeblk.c | 2 +-
> >  include/linux/blkdev.h | 1 -
> >  3 files changed, 2 insertions(+), 5 deletions(-)
> >
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index f1a1795..f52c223 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -257,9 +257,7 @@ void blk_limits_max_hw_sectors(struct queue_limits
> *limits, unsigned int max_hw_
> >__func__, max_hw_sectors);
> > }
> >
> > -   limits->max_hw_sectors = max_hw_sectors;
> > -   limits->max_sectors = min_t(unsigned int, max_hw_sectors,
> > -   BLK_DEF_MAX_SECTORS);
> > +   limits->max_sectors = limits->max_hw_sectors = max_hw_sectors;
> >  }
> >  EXPORT_SYMBOL(blk_limits_max_hw_sectors);

1. Documentation/block/biodoc.txt needs some updates:

blk_queue_max_sectors(q, max_sectors)
Sets two variables that limit the size of the request.

- The request queue's max_sectors, which is a soft size in
units of 512 byte sectors, and could be dynamically varied
by the core kernel.

- The request queue's max_hw_sectors, which is a hard limit
and reflects the maximum size request a driver can handle
in units of 512 byte sectors.

The default for both max_sectors and max_hw_sectors is
255. The upper limit of max_sectors is 1024.

There is no function with that name (it is now called
blk_queue_max_hw_sectors), the upper limit of max_sectors
is max_hw_sectors, and the default is misleading (255
is the default if the LLD doesn't provide max_hw_sectors).

2. Testing with hpsa and mpt3sas, this patch works as expected
for this setting.  I/O sizes are still limited by max_segments, 
which is expected.  Something else is still limiting I/O sizes
to 1 MiB, though; probably bio_get_nr_vecs enforcing a maximum
size per bio of BIO_MAX_PAGES 256 (which is 1 MiB with 4 KiB
pages).


Otherwise,
Reviewed-by: Robert Elliott 
Tested-by: Robert Elliott 

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: ftrace function-graph and interprocessor interrupts

2014-09-25 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Steven Rostedt [mailto:rost...@goodmis.org]
...
> Does, this patch fix it for you?
> 
> -- Steve
...

> - if (type == TRACE_GRAPH_ENT)
> - ret = trace_seq_puts(s, "==>");
> - else
> + if (trace_flags & TRACE_ITER_LATENCY_FMT) {
> + ret = print_graph_lat_fmt(s, ent);
> + if (ret == TRACE_TYPE_PARTIAL_LINE)
> + return TRACE_TYPE_PARTIAL_LINE;
> + }
> +
> + if (*cpu_in_irq) {
>   ret = trace_seq_puts(s, "<==");
> + *cpu_in_irq = 0;
> + } else {
> + ret = trace_seq_puts(s, "==>");
> + *cpu_in_irq = in_irq;
> + }

That changes the direction of the arrows (which is fine, if
intended).

Results:
The beginning and end of do_IRQ are marked, but it drops off
during the middle.  The normal function trace shows all of those
still in hardirq context.


 10)   <== |
 10)   |do_IRQ() {
 10)   |  irq_enter() {
 10)   0.082 us|rcu_irq_enter();
 10)   0.101 us|irqtime_account_irq();
 10)   1.321 us|  } /* irq_enter */
 10)   ==> |
 10)   0.058 us|  exit_idle();
 10)   |  handle_irq() {
 10)   0.091 us|irq_to_desc();
 10)   |handle_edge_irq() {
 10)   0.068 us|  _raw_spin_lock();
 10)   0.142 us|  ir_ack_apic_edge();
 10)   |  handle_irq_event() {
 10)   |
handle_irq_event_percpu() {
...
 10)   0.092 us|  
add_interrupt_randomness();
 10)   0.080 us|  note_interrupt();
 10)   9.510 us|} /* 
handle_irq_event_percpu */
 10)   0.071 us|_raw_spin_lock();
 10) + 10.577 us   |  } /* handle_irq_event */
 10) + 12.310 us   |} /* handle_edge_irq */
 10) + 13.901 us   |  } /* handle_irq */
 10)   |  irq_exit() {
 10)   0.111 us|irqtime_account_irq();
 10)   <== |
 10)   0.074 us|idle_cpu();
 10)   0.080 us|rcu_irq_exit();
 10)   1.800 us|  } /* irq_exit */
 10) + 19.132 us   |} /* do_IRQ */
 10)   ==> |


Same with smp_apic_timer_interrupt:
 10)   <== |
 10)   |  smp_apic_timer_interrupt() {
 10)   |irq_enter() {
 10)   0.072 us|  rcu_irq_enter();
 10)   0.099 us|  irqtime_account_irq();
 10)   1.069 us|} /* irq_enter */
 10)   ==> |
 10)   0.060 us|exit_idle();
 10)   |
local_apic_timer_interrupt() {
 10)   |  hrtimer_interrupt() {
...
 10)   1.810 us|} /* tick_program_event 
*/
 10) + 32.361 us   |  } /* hrtimer_interrupt */
 10) + 32.904 us   |} /* 
local_apic_timer_interrupt */
 10)   |irq_exit() {
 10)   0.111 us|  irqtime_account_irq();
 10)   <== |
 10)   |  __do_softirq() {
 10)   0.058 us|msecs_to_jiffies();
 10)   0.099 us|irqtime_account_irq();
 10)   |
smp_call_function_single_interrupt() {
 10)   |  irq_enter() {
 10)   0.069 us|rcu_irq_enter();
 10)   0.100 us|
irqtime_account_irq();
 10)   1.251 us|  } /* irq_enter */
 10)   ==> |


generic_smp_call_function_single_interrupt doesn't
seem to be getting marked, even though function trace 
finds many that are in hardirq context.

 fio-7146  [010] d...  2968.183376: 
smp_call_function_single_interrupt <-call_function_single_interrupt
 fio-7146  [010] d...  2968.183376: irq_enter 
<-smp_call_function_single_interrupt

RE: [PATCH RESEND v3 5/6] AHCI: Optimize single IRQ interrupt processing

2014-09-24 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Tejun Heo
...
> The thing I don't get is why multiple MSI handling and this patchset
> are tied to threaded interrupt handling.  Splitting locks don't
> necessarily have much to do with threaded handling and it's not like
> ahci interrupt handling is heavy.  The hot path is pretty short
> actually.  The meat of the work - completing requests and propagating
> completions - is offloaded to softirq by block layer anyway.

blk-mq/scsi-mq chose to move all completion work into hardirq context,
so this seems headed in a different direction.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: ftrace function-graph and interprocessor interrupts

2014-09-24 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Andi Kleen [mailto:a...@firstfloor.org]
> Sent: Wednesday, 24 September, 2014 4:50 PM
> To: Elliott, Robert (Server Storage)
> Cc: Steven Rostedt; linux-kernel@vger.kernel.org; Jens Axboe
>  (ax...@kernel.dk); Christoph Hellwig
> Subject: Re: ftrace function-graph and interprocessor interrupts
> 
> "Elliott, Robert (Server Storage)"  writes:
> 
> > The function-graph tracer marks some interrupt handler functions
> > with ==>  and <== labels.
> 
> I'm not sure the marking is really that useful. Isn't it always obvious
> from the function names where an interrupt starts/end?
> 
> -Andi

Although the do_IRQ name stands out pretty well, some of the
others don't, and blk-mq calling them directly makes it hard
to tell.  They show up clearly in the function trace, just
not the function_graph trace.

Also, the IPI function can end up nested inside { } but 
without indents, depending on when it occurs.

 10)   |
sd_setup_read_write_cmnd [sd_mod]() {
 10)   |
smp_call_function_single_interrupt() {
 10)   |  irq_enter() {
...
 10) + 36.788 us   |} /* 
smp_call_function_single_interrupt */
 10)   |  scsi_init_io() {

The ==> labels also add an indent level.

I'd like to add an option to exclude the time taken by interrupts
in the cumulative times, but that first requires that function_graph
understand what times to exclude.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ftrace function-graph and interprocessor interrupts

2014-09-24 Thread Elliott, Robert (Server Storage)

The function-graph tracer marks some interrupt handler functions
with ==>  and <== labels.

 10)   |
sd_setup_read_write_cmnd [sd_mod]() {
 10)   |  scsi_init_io() {
 10)   |scsi_init_sgtable() 
{
 10)   |  
scsi_alloc_sgtable() {
 10)   ==> |
 10)   |
smp_apic_timer_interrupt() {
 10)   |  irq_enter() {
 10)   0.093 us|
rcu_irq_enter();
 10)   0.102 us|
irqtime_account_irq();
 10)   1.213 us|  } /* 
irq_enter */
...
 10)   0.054 us|idle_cpu();
 10)   0.077 us|
rcu_irq_exit();
 10)   6.953 us|  } /* irq_exit 
*/
 10) + 45.238 us   |} /* 
smp_apic_timer_interrupt */
 10)   <== |
 10) + 46.256 us   |  } /* 
scsi_alloc_sgtable */
 10)   |  blk_rq_map_sg() {
 10)   0.101 us|
__blk_bios_map_sg();


Interprocessor interrupts are not marked, though (on x86_64). In 
this example, smp_call_function_single_interrupt is really in hardirq
context.  The trace just shows it at the same level as the function
it interrupted.


 10)   0.068 us|  scsi_prep_state_check();
 10)   0.095 us|  get_device();
 10)   |  scsi_mq_prep_fn() {
 10)   0.063 us|init_timer_key();
 10)   |scsi_setup_cmnd() {
 10)   |  sd_init_command 
[sd_mod]() {
 10)   |
sd_setup_read_write_cmnd [sd_mod]() {
 10)   |
smp_call_function_single_interrupt() {
 10)   |  irq_enter() {
 10)   0.096 us|rcu_irq_enter();
 10)   0.116 us|
irqtime_account_irq();
 10)   1.243 us|  } /* irq_enter */
 10)   |  
generic_smp_call_function_single_interrupt() {
 10)   |
flush_smp_call_function_queue() {
 10)   |  
__blk_mq_complete_request_remote() {
 10)   |
scsi_softirq_done() {
 10)   |  
scsi_decide_disposition() {
...
 10) + 36.788 us   |} /* 
smp_call_function_single_interrupt */
 10)   |  scsi_init_io() {
 10)   |scsi_init_sgtable() 
{
 10)   0.098 us|  
scsi_alloc_sgtable();
 10)   |  blk_rq_map_sg() {
 10)   0.148 us|
__blk_bios_map_sg();
 10)   0.722 us|  } /* 
blk_rq_map_sg */
 10)   1.862 us|} /* 
scsi_init_sgtable */
 10)   2.424 us|  } /* scsi_init_io */
 10)   3.080 us|} /* 
sd_setup_read_write_cmnd [sd_mod] */
 10) + 41.431 us   |  } /* sd_init_command 
[sd_mod] */
 10) + 42.012 us   |} /* scsi_setup_cmnd */
 10) + 43.179 us   |  } /* scsi_mq_prep_fn */

The plain function tracer shows that smp_call_function_single_interrupt 
is really in hardirq context, marking it with "h":
 fio-4607  [010]   5908.340807: blkdev_get_block <-do_direct_IO
 fio-4607  [010]   5908.340807: put_page <-do_direct_IO
 fio-4607  [010] d.h.  5908.340808: 
generic_smp_call_function_single_interrupt <-smp_call_function_single_interrupt
 fio-4607  [010] d.h.  5908.340808: flush_smp_call_function_queue 
<-generic_smp_call_function_single_interrupt
 fio-4607  [010] d.h.  5908.340808: 
__blk_mq_complete_request_remote <-flush_smp_call_function_queue
 fio-4607  [010] d.h.  5908.340808: scsi_softirq_done 
<-__blk_mq_c

RE: boot stall regression due to blk-mq: use percpu_ref for mq usage count

2014-09-23 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Tejun Heo
> Sent: Tuesday, 23 September, 2014 1:12 AM
> To: Christoph Hellwig
> Cc: Jens Axboe; linux-kernel@vger.kernel.org; linux-s...@vger.kernel.org
> Subject: Re: boot stall regression due to blk-mq: use percpu_ref for mq usage
> count
> 
> On Tue, Sep 23, 2014 at 08:09:06AM +0200, Christoph Hellwig wrote:
> > On Tue, Sep 23, 2014 at 02:01:41AM -0400, Tejun Heo wrote:
> > > On Tue, Sep 23, 2014 at 07:59:24AM +0200, Christoph Hellwig wrote:
> > > > "[PATCHSET percpu/for-3.18] percpu_ref: implement
> switch_to_atomic/percpu()"
> > > >
> > > > looks way to big for 3.17, and the regression was introduced in the
> 3.17
> > > > merge window.  I'm not sure what was broken before, but it defintively
> > > > survived a lot of testing.
> > >
> > > Do we even care about fixing it for 3.17?  scsi-mq isn't enabled by
> > > default even for 3.18.  The open-coded percpu ref thing was subtly
> > > broken there.  It'd be difficult to trigger but I'm fairly sure it'd
> > > crap out in the wild once in a blue moon.
> >
> > It's compiled in by default, and people are extremly eager to test it.
> 
> Ugh, I don't know.  It's not like we have a very good baseline we can
> go back to and reverting it for -stable and then redoing it seems
> kinda excessive for a yet experimental feature.  Jens?

scsi_mod.use_blk_mq is not listed in 3.17rc6 Documentation/kernel-
parameters.txt or Documentation/scsi/scsi-parameters.txt (which
does admit "This document may not be entirely up to date and
comprehensive.").

Perhaps a description with an "experimental" warning could be 
slipped into scsi-parameters.txt in 3.17, with plans to remove
that warning in 3.18.

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch 22/26] x86: Add proper vector accounting for HyperV

2014-09-22 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Thomas Gleixner
> Sent: Sunday, 23 February, 2014 3:40 PM
> To: LKML
> Cc: Ingo Molnar; Peter Zijlstra; x86; Konrad Rzeszutek Wilk; K. Y. Srinivasan
> Subject: [patch 22/26] x86: Add proper vector accounting for HyperV
> 
> HyperV abuses a device interrupt to account for the
> HYPERVISOR_CALLBACK_VECTOR.
> 
> Provide proper accounting as we have for the other vectors as well.
...
> Index: tip/arch/x86/kernel/irq.c
> ===
> --- tip.orig/arch/x86/kernel/irq.c
> +++ tip/arch/x86/kernel/irq.c
> @@ -125,6 +125,12 @@ int arch_show_interrupts(struct seq_file
>   seq_printf(p, "%10u ", per_cpu(mce_poll_count, j));
>   seq_printf(p, "  Machine check polls\n");
>  #endif
> +#if defined(CONFIG_HYPERV) || defined(CONFIG_XEN)
> + seq_printf(p, "%*s: ", prec, "THR");
> + for_each_online_cpu(j)
> + seq_printf(p, "%10u ", irq_stats(j)->irq_hv_callback_count);
> + seq_printf(p, "  Hypervisor callback interrupts\n");
> +#endif
...

Should this have used a different short string than "THR",
which is already used for "Threshold APIC interrupts"?

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)

2014-09-22 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Friday, 19 September, 2014 6:22 AM
> To: Elliott, Robert (Server Storage)
> Cc: Andreas Dilger; Milosz Tanski; linux-kernel@vger.kernel.org; Christoph
> Hellwig; linux-fsde...@vger.kernel.org; linux-...@kvack.org; Mel Gorman;
> Volker Lendecke; Tejun Heo; Jeff Moyer
> Subject: Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
> 
> On Mon, Sep 15, 2014 at 10:36:46PM +, Elliott, Robert (Server Storage)
> wrote:
> > That sounds like the proposed WRITE SCATTERED/READ GATHERED
> > commands for SCSI (where are related to, but not necessarily
> > tied to, atomic writes).  We discussed them a bit at
> > LSF-MM 2013 - see http://lwn.net/Articles/548116/.
> 
> In the same way a preadx/pwritex could use but would not require an
> O_ATOMIC.  What's the status of those in t10?  Last I heard
> READ GATHERED was out and they were only looking into WRITE SCATTERED?

Both of these essentially require more CDB bytes to convey the
LBA range list.  Under the current SCSI architecture model, the 
choices are:
* include in a longer CDB
* include in the data-out buffer

For longer CDBs:
* CDBs >16 bytes are not widely supported
* 260 byte max CDB size limits the number of LBA ranges
* in most SCSI protocols, commands are unsolicited (push rather
than pull), so the target must have buffer space for (max queue
depth)*(max CDB size). In SCSI Express, although CDBs are pulled
with PCIe memory reads rather than pushed, longer CDBs complicate
circular queue handling.

For the data-out buffer:
* not delivering all the CDB info upfront complicates drive 
hardware designs. They want to get the data transfer started
from the medium, but have to wait for a whole extra DMA 
transfer first. This is not so bad for low-latency PCIe,
but is not a good fit for protocols behind HBAs like
SAS, iSCSI, etc.
* READ GATHERED requires bidirectional command support, which 
is not widely or efficiently supported

Protocols could add direct support for delivering more CDB bytes
(like how the ATA PACKET command delivers a SCSI CDB over
an ATA transport), but that requires a lot of changes.

---
Rob ElliottHP Server Storage

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: blk-mq timeout handling fixes

2014-09-17 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Saturday, 13 September, 2014 6:40 PM
> To: Jens Axboe
> Cc: Elliott, Robert (Server Storage); linux-s...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: blk-mq timeout handling fixes
> 
> This series fixes various issues with timeout handling that Robert
> ran into when testing scsi-mq heavily.  He tested an earlier version,
> and couldn't reproduce the issues anymore, although the series changed
> quite significantly since and should probably be retested.
> 
> In summary we not only start the blk-mq timer inside the drivers
> ->queue_rq method after the request has been fully setup, and we
> also tell the drivers if we're timing out a reserved (internal)
> request or a real one.  Many drivers including will need to handle
> those internal ones differently, e.g. for scsi-mq we don't even
> have a scsi command structure allocated for the reserved commands.

I have rerun a variety of tests on:
* Jens' for-next tree that went into 3.17rc5
* plus this series
* plus two patches for infinite recursion on flushes from 
  Ming and then Christoph

and have not been able to trigger the scsi_times_out req->special
NULL pointer dereference that prompted this series.

Testing includes:
* concurrent heavy workload generators:
  * fio high iodepth direct 512 byte random reads (> 1M IOPS)
  * programs generating large bursts of paged writes
* mkfs.ext4 (followed by e2fsck)
* mkfs.xfs (followed by xfs_check)
* ddpt
  * watch -n 0 sync to generate flushes
* scsi_logging_level MLCOMPLETE set to 0 or 1
  * scsi_lib.c patched to put all the ACTION_FAIL messages
under level 1 so they can be squelched (massive error 
prints cause more timeouts themselves)
* 4 hpsa and 16 mpt3sas devices (all made from SAS SSDs)
  * lockless hpsa driver
* injecting errors
  * device removal
  * device generating infinite errors
  * device generating a brief number of errors

The filesystems don't always recover properly, but nothing in 
the block or scsi midlayers crashed.

So, you may add this to the series:
Tested-by: Robert Elliott 

---
Rob ElliottHP Server Storage




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)

2014-09-15 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Andreas Dilger
> Sent: Monday, 15 September, 2014 4:34 PM
> To: Milosz Tanski
> Cc: linux-kernel@vger.kernel.org; Christoph Hellwig; linux-
> fsde...@vger.kernel.org; linux-...@kvack.org; Mel Gorman; Volker Lendecke;
> Tejun Heo; Jeff Moyer
> Subject: Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
> 
> On Sep 15, 2014, at 2:20 PM, Milosz Tanski  wrote:
> 
> > This patcheset introduces an ability to perform a non-blocking read
> > from regular files in buffered IO mode. This works by only for those
> > filesystems that have data in the page cache.
> >
> > It does this by introducing new syscalls new syscalls readv2/writev2
> > and preadv2/pwritev2. These new syscalls behave like the network sendmsg,
> > recvmsg syscalls that accept an extra flag argument (O_NONBLOCK).
> 
> It's too bad that we are introducing yet another new read/write
> syscall pair that only allow IO into discontiguous memory regions,
> but do not allow a single call to access discontiguous file regions
> (i.e. specify a separate file offset for each iov).
> 
> Adding syscalls similar to preadv/pwritev() that could take a iovec
> that specified the file offset+length in addition to the memory address
> would allow efficient scatter-gather IO in a single syscall.  While
> that is less critical for local filesystems with small syscall latency,
> it is more important for network filesystems, or in the case of
> NVRAM-backed filesystems.
> 
> Cheers, Andreas

That sounds like the proposed WRITE SCATTERED/READ GATHERED 
commands for SCSI (where are related to, but not necessarily
tied to, atomic writes).  We discussed them a bit at 
LSF-MM 2013 - see http://lwn.net/Articles/548116/.


---
Rob ElliottHP Server Storage





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] scsi: scsi_devinfo.c: Cleaning up unnecessarily complicated in conjunction with strncpy

2014-09-14 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Rickard Strandqvist [mailto:rickard_strandqv...@spectrumdigital.se]
> How do you mean?
> 
> strncpy zeroes throughout the remainder of the string "from" until the
> length off to_length, or otherwise guaranteed trailing zero characters
> and a warning is printed.
> 
> Is not it exactly the functionality that is desired?

Ah, I see that in man 3 strcpy: 
"If the length of src is less than n, strncpy() pads the 
 remainder of dest with null bytes."

I agree that should work.

---
Rob ElliottHP Server Storage

RE: [PATCH] scsi: scsi_devinfo.c: Cleaning up unnecessarily complicated in conjunction with strncpy

2014-09-14 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Rickard Strandqvist
...
> diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
...
>  static void scsi_strcpy_devinfo(char *name, char *to, size_t to_length,
>   char *from, int compatible)
>  {
> - size_t from_length;
> -
> - from_length = strlen(from);
> - strncpy(to, from, min(to_length, from_length));
> - if (from_length < to_length) {
> - if (compatible) {
> - /*
> -  * NUL terminate the string if it is short.
> -  */
> - to[from_length] = '\0';
> - } else {
> - /*
> -  * space pad the string if it is short.
> -  */
> - strncpy(&to[from_length], spaces,
> - to_length - from_length);
> - }
> - }
> - if (from_length > to_length)
> -  printk(KERN_WARNING "%s: %s string '%s' is too long\n",
> + strncpy(to, from, to_length);
> + if (to[to_length - 1] != '\0') {
> + to[to_length - 1] = '\0';
> + printk(KERN_WARNING "%s: %s string '%s' is too long\n",
>   __func__, name, from);
> + }

The caller of this function, scsi_dev_info_list_add_keyed, created
the "to" destination buffer, devinfo, with kmalloc, so it's not
guaranteed to be full of zeros.

If from_length is shorter than to_length, then this code will
be inspecting an uninitialized character that strncpy didn't
touch.

---
Rob ElliottHP Server Storage





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] block: default to rq_affinity=2 for blk-mq

2014-09-10 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Wednesday, 10 September, 2014 1:15 PM
> To: Robert Elliott; Elliott, Robert (Server Storage); h...@lst.de;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 1/2] block: default to rq_affinity=2 for blk-mq
> 
> On 09/09/2014 06:18 PM, Robert Elliott wrote:
> > From: Robert Elliott 
> >
> > One change introduced by blk-mq is that it does all
> > the completion work in hard irq context rather than
> > soft irq context.
> >
> > On a 6 core system, if all interrupts are routed to
> > one CPU, then you can easily run into this:
> > * 5 CPUs submitting IOs
> > * 1 CPU spending 100% of its time in hard irq context
> > processing IO completions, not able to submit anything
> > itself
> >
> > Example with CPU5 receiving all interrupts:
> >CPU usage:   CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
> > %usr:   0.00   3.03   1.01   2.02   2.00   0.00
> > %sys:  14.58  75.76  14.14   4.04  78.00   0.00
> > %irq:   0.00   0.00   0.00   1.01   0.00 100.00
> >%soft:   0.00   0.00   0.00   0.00   0.00   0.00
> > %iowait idle:  85.42  21.21  84.85  92.93  20.00   0.00
> >%idle:   0.00   0.00   0.00   0.00   0.00   0.00
> >
> > When the submitting CPUs are forced to process their own
> > completion interrupts, this steals time from new
> > submissions and self-throttles them.
> >
> > Without that, there is no direct feedback to the
> > submitters to slow down.  The only feedback is:
> > * reaching max queue depth
> > * lots of timeouts, resulting in aborts, resets, soft
> >   lockups and self-detected stalls on CPU5, bogus
> >   clocksource tsc unstable reports, network
> >   drop-offs, etc.
> >
> > The SCSI LLD can set affinity_hint for each of its
> > interrupts to request that a program like irqbalance
> > route the interrupts back to the submitting CPU.
> > The latest version of irqbalance ignores those hints,
> > though, instead offering an option to run a policy
> > script that could honor them. Otherwise, it balances
> > them based on its own algorithms. So, we cannot rely
> > on this.
> >
> > Hardware might perform interrupt coalescing to help,
> > but it cannot help 1 CPU keep up with the work
> > generated by many other CPUs.
> >
> > rq_affinity=2 helps by pushing most of the block layer
> > and SCSI midlayer completion work back to the submitting
> > CPU (via an IPI).
> >
> > Change the default rq_affinity=2 under blk-mq
> > so there's at least some feedback to slow down the
> > submitters.
> 
> I don't think we should do this generically. For "sane" devices with
> multiple completion queues, and with proper affinity setting in the
> driver, this is going to be a loss.
> 
> So lets not add it to QUEUE_FLAG_MQ_DEFAULT, but we can make it
> default
> for nr_hw_queues == 1. I think that would be way saner.
> 
> --
> Jens Axboe

If the interrupt does arrive on the submitting CPU, then it 
meets the criteria for all the cases:
* 1: complete on any CPU
* 2: complete on submitting CPU's node (QUEUE_FLAG_SAME_COMP)
* 3: complete on submitting CPU (QUEUE_FLAG_SAME_FORCE)

and _blk_complete_request handles it locally rather
than sending an IPI.

if (req->cpu != -1) {
ccpu = req->cpu;
if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
shared = cpus_share_cache(cpu, ccpu);
} else
ccpu = cpu;
...
if (ccpu == cpu || shared) {
struct list_head *list;
do_local:
...
} else if (raise_blk_irq(ccpu, req))
goto do_local;


Are you saying you want the blk_queue_bio submission to 
not even set the req->cpu field (which defaulted to -1):
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
req->cpu = raw_smp_processor_id();

when you expect the interrupt routing is good so that
_blk_complete_request can avoid the test_bit and 
cpus_share_cache calls?

With irqbalance no longer honoring affinity_hint
by default, I'm worried that most LLDs will not find
their interrupts routed that way anymore.  That's
how we ran into this; scsi-mq + kernel-3.17 on an
up-to-date RHEL 6.5 distro (which now carries the
new irqbalance).

We plan to create a policyscript for the new irqbalance
for hpsa devices, but other high-IOPS drivers will hit
the same problem.

---
Rob ElliottHP Server Storage

RE: [PATCH 0/6] blk-mq: initialize pdu of flush req explicitly

2014-09-08 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Ming Lei
> Sent: Monday, 08 September, 2014 11:55 AM
> To: Christoph Hellwig
> Cc: Jens Axboe; Linux Kernel Mailing List; Linux SCSI List
> Subject: Re: [PATCH 0/6] blk-mq: initialize pdu of flush req
> explicitly
> 
> On Mon, Sep 8, 2014 at 2:49 AM, Christoph Hellwig  wrote:
> > This works fine for me, although I still don't really like it very
> much.
> >
> > If you really want to go down the path of major surgery in this
> > area we should probably allocate a flush request per hw_ctx, and
> > initialize it using the normal init/exit functions.  If we want
> > to have proper multiqueue performance on devices needing flushes
> > we'll need something like that anyway.
> 
> Yes, that should be the final solution for the problem, and looks the
> whole flush machinery need to move into hctx, I will try to figure
> out one patch to do that.

Please change flush_rq allocation from kzalloc to kzalloc_node 
while operating on that code (would have affected PATCH 1/6).

blk_mq_init_queue currently has this for q->flush_rq:
q->flush_rq = kzalloc(round_up(sizeof(struct request) +
set->cmd_size, cache_line_size()),
GFP_KERNEL);

while all its other allocations are tied to set->numa_node:
hctxs = kmalloc_node(set->nr_hw_queues * sizeof(*hctxs), GFP_KERNEL,
set->numa_node);
q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node);

or, for per-CPU structures, tied to the appropriate node:
for (i = 0; i < set->nr_hw_queues; i++) {
int node = blk_mq_hw_queue_to_node(map, i);

hctxs[i] = kzalloc_node(sizeof(struct blk_mq_hw_ctx),
GFP_KERNEL, node);

Per-hctx flush requests would mean following the hctxs[i]
approach.


---
Rob ElliottHP Server Storage

RE: [PATCH v2 4/6] block: loop: say goodby to bio

2014-08-30 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Ming Lei
> Sent: Saturday, 30 August, 2014 11:08 AM
> To: Jens Axboe; linux-kernel@vger.kernel.org; Dave Kleikamp
> Cc: Zach Brown; Christoph Hellwig; Maxim Patlasov; Ming Lei
> Subject: [PATCH v2 4/6] block: loop: say goodby to bio
...
> -static int do_bio_filebacked(struct loop_device *lo, struct bio
> *bio)
> +static int do_req_filebacked(struct loop_device *lo, struct request
> *rq)
>  {
>   loff_t pos;
>   int ret;
> 
> - pos = ((loff_t) bio->bi_iter.bi_sector << 9) + lo->lo_offset;
> + pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
> 
> - if (bio_rw(bio) == WRITE) {
> + if (rq->cmd_flags & REQ_WRITE) {
>   struct file *file = lo->lo_backing_file;
> 
> - if (bio->bi_rw & REQ_FLUSH) {
> + if (rq->cmd_flags & REQ_FLUSH) {
>   ret = vfs_fsync(file, 0);
>   if (unlikely(ret && ret != -EINVAL)) {
>   ret = -EIO;
> @@ -436,7 +439,7 @@ static int do_bio_filebacked(struct loop_device
> *lo, struct bio *bio)
>* encryption is enabled, because it may give an attacker
>* useful information.
>*/
> - if (bio->bi_rw & REQ_DISCARD) {
> + if (rq->cmd_flags & REQ_DISCARD) {
>   struct file *file = lo->lo_backing_file;
>   int mode = FALLOC_FL_PUNCH_HOLE |
> FALLOC_FL_KEEP_SIZE;

Can a REQ_WRITE_SAME make it through to here?


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2 3/6] block: loop: convert to blk-mq

2014-08-30 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Ming Lei
> Sent: Saturday, 30 August, 2014 11:08 AM
> To: Jens Axboe; linux-kernel@vger.kernel.org; Dave Kleikamp
> Cc: Zach Brown; Christoph Hellwig; Maxim Patlasov; Ming Lei
> Subject: [PATCH v2 3/6] block: loop: convert to blk-mq
> 
...
> -static inline void loop_handle_bio(struct loop_device *lo, struct
> bio *bio)
> +static inline int loop_handle_bio(struct loop_device *lo, struct bio
> *bio)
>  {
> - if (unlikely(!bio->bi_bdev)) {
> - do_loop_switch(lo, bio->bi_private);
> - bio_put(bio);
> - } else {
> - int ret = do_bio_filebacked(lo, bio);
> - bio_endio(bio, ret);
> - }
> + int ret = do_bio_filebacked(lo, bio);
> + return ret;

No need for the temporary variable; just return the function
call.

...
> +static int loop_prepare_hctxs(struct loop_device *lo)
> +{
> + struct request_queue *q = lo->lo_queue;
> + struct blk_mq_hw_ctx *hctx;
> + struct loop_hctx_data *data;
> + unsigned int i;
> +
> + queue_for_each_hw_ctx(q, hctx, i) {
> + BUG_ON(i >= lo->tag_set.nr_hw_queues);

If something goes bad in the loop driver like that, is it
necessary to crash the whole kernel?  

> + data = hctx->driver_data;
> +
> + data->lo = lo;
> + init_kthread_worker(&data->worker);
> + data->worker_task = kthread_run(kthread_worker_fn,
> + &data->worker, "loop%d-%d",
> + lo->lo_number, i);
> + if (IS_ERR(data->worker_task)) {
> + loop_unprepare_hctxs(lo, i);
> + return -ENOMEM;
> + }
> + set_user_nice(data->worker_task, MIN_NICE);
> + sched_getaffinity(data->worker_task->pid, hctx->cpumask);

Is that supposed to be sched_setaffinity?  It looks like
getaffinity does have a side-effect of updating the CPU mask
based on the current active cpus, but there won't be a CPU mask
to update unless setaffinity was called.

...
> @@ -906,14 +848,10 @@ static int loop_set_fd(struct loop_device *lo,
> fmode_t mode,
> 
>   set_blocksize(bdev, lo_blocksize);
> 
> - lo->lo_thread = kthread_create(loop_thread, lo, "loop%d",
> - lo->lo_number);
> - if (IS_ERR(lo->lo_thread)) {
> - error = PTR_ERR(lo->lo_thread);
> + if ((error = loop_prepare_hctxs(lo)) != 0)
>   goto out_clr;

I've been told that linux kernel style is:
error = x();
if (error)

...
> @@ -1014,7 +951,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   lo->lo_state = Lo_rundown;
>   spin_unlock_irq(&lo->lo_lock);
> 
> - kthread_stop(lo->lo_thread);
> + loop_unprepare_hctxs(lo, lo->tag_set.nr_hw_queues);
> 
>   spin_lock_irq(&lo->lo_lock);
>   lo->lo_backing_file = NULL;
...
> +
> +static int loop_prepare_flush_rq(void *data, struct request_queue
> *q,
> + struct request *flush_rq,
> + const struct request *src_rq)
> +{
> + /* borrow initialization helper for common rq */
> + loop_init_request(data, flush_rq, 0, -1, NUMA_NO_NODE);
> + return 0;
> +}

In patch 2/6 that function is called with:
if (orig_rq->q->mq_ops->prepare_flush_rq)
ret = orig_rq->q->mq_ops->prepare_flush_rq(
orig_rq->q->tag_set->driver_data,
orig_rq->q, flush_rq, orig_rq);


The src_rq argument is not used in this new function.
Do you anticipate it might be necessary in other drivers?

Is this new function allowed to modify *data, *flush_rq,
or *src_rq?  If not, const might be good to add.

If orig_rq is always passed, then this function could 
decode orig_rq->q and orig_rq->q->tag_set_>driver_data
on its own, eliminating the need for the first two arguments.

Similarly, in the unprepare_flush_rq function in patch 2,
which is not provided by the loop driver in this patch:

+   if (q->mq_ops->unprepare_flush_rq)
+   q->mq_ops->unprepare_flush_rq(
+   q->tag_set->driver_data,
+   q, q->flush_rq);

if q is always passed, then that function could calculate
q->tag_set_driver_data and q->flush_rq itself, eliminating
two arguments.

> +
> +static int loop_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
> + unsigned int index)
> +{
> + hctx->driver_data = kmalloc(sizeof(struct loop_hctx_data),
> + GFP_KERNEL);

hctx->numa_node has been set before this; how about passing it 
along to kmalloc_node in case it has a useful value?

blk_mq_init_hw_queues sets it before calling this function:
node = hctx->numa_node;
if (node == NUMA_NO_NODE)
node = hctx->numa_node = set->numa_no

RE: Revert "aio: fix aio request leak when events are reaped by user space"

2014-08-25 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Benjamin LaHaise
> Sent: Friday, 22 August, 2014 11:27 AM
...
> Ah, that was missing a hunk then.  Try this version instead.
> 
...
> diff --git a/fs/aio.c b/fs/aio.c
> index ae63587..fbdcc47 100644

Using this version of the patch, I ran into this crash after 36
hours of scsi-mq testing over the weekend.

The test was running heavy traffic to four scsi-mq based devices:
* fio running 4 KiB random reads 
  * ioengine=libaio
  * not using userspace_reap=1
* mkfs.ext4 and e2fsck, generating huge write bursts

io_submit_one triggered an NMI:
[132204.801834] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
cpu 0
[132204.804503] CPU: 0 PID: 8998 Comm: fio Tainted: GE  3.17.0-rc1+ 
#15
[132204.806998] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 09/08/2013
[132204.809292]   88043f407bb8 815af54f 
0001
[132204.812076]  81809fe8 88043f407c38 815af2ce 
0010
[132204.814858]  88043f407c48 88043f407be8  

[132204.817557] Call Trace: 
[132204.818442][] dump_stack+0x49/0x62
[132204.820538]  [] panic+0xbb/0x1f8
[132204.822284]  [] watchdog_overflow_callback+0xb1/0xc0
[132204.824512]  [] __perf_event_overflow+0x98/0x230
[132204.826603]  [] perf_event_overflow+0x14/0x20
[132204.828618]  [] intel_pmu_handle_irq+0x1ec/0x3c0
[132204.830819]  [] perf_event_nmi_handler+0x34/0x60
[132204.832933]  [] nmi_handle+0x87/0x120
[132204.834748]  [] default_do_nmi+0x54/0x110
[132204.836670]  [] do_nmi+0x90/0xe0
[132204.838347]  [] end_repeat_nmi+0x1e/0x2e
[132204.840248]  [] ? io_submit_one+0x174/0x4b0
[132204.842293]  [] ? io_submit_one+0x174/0x4b0
[132204.844257]  [] ? io_submit_one+0x174/0x4b0
[132204.846161]  <>  [] do_io_submit+0x13c/0x200
[132204.848438]  [] ? pick_next_task_fair+0x163/0x220
[132204.850642]  [] SyS_io_submit+0x10/0x20
[132204.852519]  [] system_call_fastpath+0x16/0x1b
[132204.854608] Kernel Offset: 0x0 from 0x8100 (relocation range: 
0x8000-0x9fff)
[132204.857976] ---[ end Kernel panic - not syncing: Watchdog detected hard 
LOCKUP on cpu 0

io_submit_one+0x174 is offset 0x1f94.  Per objdump -drS aio.o,
that's code from put_reqs_available inlined into refill_reqs_available
inlined into io_submit_one.

...
 * Atomically adds @i to @v.
 */
static inline void atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
1f8c:   41 8b 44 24 78  mov0x78(%r12),%eax
1f91:   f0 01 03lock add %eax,(%rbx)

local_irq_save(flags);
kcpu = this_cpu_ptr(ctx->cpu);
kcpu->reqs_available += nr;

while (kcpu->reqs_available >= ctx->req_batch * 2) {
1f94:   8b 01   mov(%rcx),%eax
1f96:   41 8b 54 24 78  mov0x78(%r12),%edx
1f9b:   8d 34 12lea(%rdx,%rdx,1),%esi
1f9e:   39 f0   cmp%esi,%eax
1fa0:   73 e6   jae1f88 


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] checkpatch: look for common misspellings

2014-08-22 Thread Elliott, Robert (Server Storage)

> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Joe Perches
...
> +chang||change

Although there are at least 2 misuses in the current kernel,
that's also a fairly common name.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH -logging 00/10] scsi/constants: Output continuous error messages on trace

2014-08-21 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Yoshihiro YUNOMAE
> Sent: Friday, 08 August, 2014 6:50 AM
> Subject: [RFC PATCH -logging 00/10] scsi/constants: Output continuous
> error messages on trace
...
> 1) printk
> Keeps current implemntation of upstream kernel.
> The messages are divided and can be mixed, but all users can
> check the error messages without any settings.

scsi_io_completion ignore the scsi_logging_level and always calls
printk if it detects ACTION_FAIL, resulting in messages like:

[10240.338600] sd 2:0:0:0: [sdr]
[10240.339722] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[10240.341662] sd 2:0:0:0: [sdr]
[10240.342792] Sense Key : Hardware Error [current]
[10240.344575] sd 2:0:0:0: [sdr]
[10240.345653] Add. Sense: Logical unit failure
[10240.347138] sd 2:0:0:0: [sdr] CDB:
[10240.348309] Read(10): 28 00 00 00 00 80 00 00 08 00

If you trigger hundreds of errors (e.g., hot remove a device
during heavy IO), then all the prints to the linux serial console
bog down the system, causing timeouts in commands to other
devices and soft lockups for applications.

Some changes that would help are:
1. Put them under SCSI logging level control
2. Use printk_ratelimited so an excessive number are trimmed

Would you like to include something like this in your
patch set?

This is an example patch that only prints them if the MLCOMPLETE 
logging level is nonzero.
Off: scsi_logging_level --set --mlcomplete=0
On: scsi_logging_level --set --mlcomplete=1

Some other loglevel (e.g., ERROR_RECOVERY) could be used.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d6b4ea8..dbb601f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1037,7 +1037,9 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
switch (action) {
case ACTION_FAIL:
/* Give up and fail the remainder of the request */
-   if (!(req->cmd_flags & REQ_QUIET)) {
+   if (!(req->cmd_flags & REQ_QUIET) &&
+   SCSI_LOG_LEVEL(SCSI_LOG_MLCOMPLETE_SHIFT,
+   SCSI_LOG_MLCOMPLETE_BITS)) {
scsi_print_result(cmd);
if (driver_byte(result) & DRIVER_SENSE)
scsi_print_sense("", cmd);

Converting to printk_ratelimited is harder since the prints
are spread out over three functions (and as your patch
series notes, many individual printk calls).  The rates
for the printk calls might not match, which would lead to
even more confusing output.

---
Rob ElliottHP Server Storage

RE: [RFC PATCH 01/10] scsi/constants: Cleanup printk message in __scsi_print_sense()

2014-08-12 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Yoshihiro YUNOMAE
> Sent: Friday, 08 August, 2014 6:50 AM
...
> Subject: [RFC PATCH 01/10] scsi/constants: Cleanup printk message in
> __scsi_print_sense()
> 
> A device name is output like "sda: Sense Key : Medium Error [current]"
> in __scsi_print_sense(), but it should be "[sda] Sense Key : Medium Error
> [current]" because other printk messages output a device name like "[sda]
> foo."
> 
...
> diff --git a/drivers/scsi/constants.c b/drivers/scsi/constants.c
> index c6a7a4a..a0e8159 100644
> --- a/drivers/scsi/constants.c
> +++ b/drivers/scsi/constants.c
> @@ -1470,7 +1470,7 @@ void __scsi_print_sense(struct scsi_device *sdev,
> const char *name,
>   return;
>   }
> 
> - sdev_printk(KERN_INFO, sdev, "%s: Sense Key : %s %s%s\n", name,
> + sdev_printk(KERN_INFO, sdev, "[%s] Sense Key : %s %s%s\n", name,
>  scsi_sense_key_string(sshdr.sense_key),
>  scsi_sense_type_string(&sshdr),
>  scsi_sense_format_string(&sshdr));
> 


The callers of __scsi_print_sense do not always pass in a name
like "sda".  In fact, sd.c doesn't even call that function; its
sd_print_sense_hdr calls sd_printk (which prints name as "[%s]")
and scsi_show_sense_hdr.

There are just 3 kernel files that call this function:
osst.c: __scsi_print_sense("osst ", SRpnt->sense, 
SCSI_SENSE_BUFFERSIZE);
osst.c: __scsi_print_sense("osst ", SRpnt->sense, 
SCSI_SENSE_BUFFERSIZE);
There is a separate call to printk just before each of those,
which prints out the name, without [].

sg.c:   __scsi_print_sense(__func__, sense,
That's printing the C function name.

st.c:__scsi_print_sense(name, SRpnt->sense, 
SCSI_SENSE_BUFFERSIZE);
st.c:   __scsi_print_sense(name, SRpnt->sense, 
SCSI_SENSE_BUFFERSIZE);
This is more like what you have in mind.


---
Rob ElliottHP Server Storage


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread Elliott, Robert (Server Storage)

> From: James Bottomley [mailto:jbottom...@parallels.com]
> 
> On Fri, 2014-07-18 at 00:51 +, Elliott, Robert (Server Storage)
> wrote:
...
> >
> > Also, in both sd_setup_flush_cmnd and sd_sync_cache:
> >   cmd->cmnd[0] = SYNCHRONIZE_CACHE;
> >   cmd->cmd_len = 10;
> >
> > SYNCHRONIZE CACHE (16) should be favored over SYNCHRONIZE
> > CACHE (10) unless SYNCHRONIZE CACHE (10) is not supported.

(sorry - meant "unless ... 16 is not supported")

> For what reason.  We usually go for the safe alternatives, which is 10
> byte commands because they have the widest testing and greatest level of
> support.  We don't do range flushes currently, so there doesn't seem to
> be a practical different.  If we did support range flushes, we'd likely
> only use SC(16) on >2TB devices.
> 
> James

A goal of the simplified SCSI feature set idea is to drop all the
short CDBs that have larger, more capable equivalents - don't carry
READ 6/10/12/16 and SYNCHRONIZE CACHE 10/16, just keep the 16-byte 
versions.  With modern serial IU-based protocols, short CDBs don't 
save any transfer time.  This will simplify design and testing on
both initiator and target sides. Competing command sets like NVMe 
rightly point out that SCSI has too much legacy baggage - all you 
need for IO is one READ, one WRITE, and one FLUSH command.

That's why SBC-3 ended up with these warning notes for all the
non-16 byte CDBs:

NOTE 15 - Migration from the SYNCHRONIZE CACHE (10) command to
the SYNCHRONIZE CACHE (16) command is recommended for all
implementations.

If the LBA field in SYNCHRONIZE CACHE went obsolete, then maybe
SYNCHRONIZE CACHE (10) would be kept instead of (16), but that
field is still present.  So, (16) is the likely survivor.

---
Rob ElliottHP Server Storage

RE: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-17 Thread Elliott, Robert (Server Storage)

In sd_sync_cache:
rq->timeout *= SD_FLUSH_TIMEOUT_MULTIPLIER;

Regardless of the baseline for the multiplication, a magic 
number of 2 is too arbitrary.  That might work for an
individual drive, but could be far too short for a RAID
controller that runs into worst case error handling for
the drives to which it is flushing (e.g., if its cache
is volatile and the drives all have recoverable errors
during writes).  That time goes up with a bigger cache and 
with more fragmented write data in the cache requiring
more individual WRITE commands.

A better value would be the Recommended Command Timeout field 
value reported in the REPORT SUPPORTED OPERATION CODES command,
if reported by the device server.  That is supposed to account
for the worst case.

For cases where that is not reported, exposing the multiplier
in sysfs would let the timeout for simple devices be set to
smaller values than complex devices.

Also, in both sd_setup_flush_cmnd and sd_sync_cache:
  cmd->cmnd[0] = SYNCHRONIZE_CACHE;
  cmd->cmd_len = 10;

SYNCHRONIZE CACHE (16) should be favored over SYNCHRONIZE 
CACHE (10) unless SYNCHRONIZE CACHE (10) is not supported.

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-16 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of James Bottomley
> Sent: Wednesday, 16 July, 2014 1:02 PM
> To: martin.peter...@oracle.com
> Cc: linux-kernel@vger.kernel.org; h...@infradead.org;
> de...@linuxdriverproject.org; a...@canonical.com; k...@microsoft.com;
> sta...@vger.kernel.org; linux-s...@vger.kernel.org; oher...@suse.com;
> jasow...@redhat.com
> Subject: Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16
> 
> On Wed, 2014-07-16 at 13:47 -0400, Martin K. Petersen wrote:
> > > "Christoph" == hch@infradead org  writes:
> >
> > Christoph> Oh, we actually have devices that support WRITE SAME with
> > Christoph> unmap, but not without?  That's defintively a little strange.
> >
> > Yep :(
> >
> > There were several SSDs that did not want to support wearing out flash
> > by writing gobs of zeroes and only support the UNMAP case.
> >
> > Christoph> Yes, and it did this intentionally.  I really wouldn't expect
> > Christoph> devices to support WRITE SAME with UNMAP but blow up on a
> > Christoph> WRITE SAME without it (and not just simple fail it in an
> > Christoph> orderly way).
> >
> > *sigh*
> >
> > Christoph> It definitively seems odd to default to trying WRITE SAME for
> > Christoph> unmap for a device that explicitly tells us that it doesn't
> > Christoph> support WRITE SAME.
> >
> > Maybe it's just a naming thing. I was really trying to convey
> > no_req_write_same support, not no_write_same_10_or_16.
> >
> > Christoph> Note that I'm not against your patch - I suspect forcing us
> > Christoph> to read EVPD pages even for devices that claim to be SPC-2
> > Christoph> will come in useful in various scenarios.
> >
> > I don't have a problem with a BLIST_PREFER_UNMAP flag or something like
> > that. But BLIST_TRY_VPD_PAGES seems more generally useful and it does
> > fix the problem at hand. That's why I went that route.
> 
> Hang on ... unless we apply Christoph or my fix, we'll get the same
> issue with every raid driver (that's about 10 I think) that set
> no_write_same when they hit a >2TB RAID volume, so I think we need both
> fixes.
> 
> James
> 

WRITE SAME with the UNMAP bit set to one (and a few other
conditions) guarantees that the data is zeroed out, while
the UNMAP command is just a hint.  They're not fully
interchangeable.  Which semantics are implied by REQ_DISCARD
and these functions?

One benefit of UNMAP is that UNMAP supports a list of 
discontiguous LBA ranges, whereas WRITE SAME just supports 
one LBA range.  sd_setup_discard_cmnd is not taking
advantage of this feature, though.  Ideally, the block
layer would merge multiple discards into one UNMAP command 
if they're stuck on the request queue for a while, like 
it merges adjacent reads and writes.  That would pave the way
for building up WRITE SCATTERED and READ GATHERED commands.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] block: make blkdev_issue_flush check for NULL from bio_alloc

2014-07-14 Thread Elliott, Robert (Server Storage)

bio_alloc can return NULL if gfp_mask does not contain __GPF_WAIT
and it doesn't use a mempool.  Since blkdev_issue_flush passes
along gfp_mask, it needs to check if bio_alloc returns NULL.

Signed-off-by: Robert Elliott 
---
 block/blk-flush.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 3cb5e9e..2a9fd20 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -457,6 +457,8 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t 
gfp_mask,
return -ENXIO;
 
bio = bio_alloc(gfp_mask, 0);
+   if (!bio)
+   return -ENOMEM;
bio->bi_bdev = bdev;
 
ret = submit_bio_wait(WRITE_FLUSH, bio);

---
Rob ElliottHP Server Storage

RE: scsi-mq V2

2014-07-13 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Elliott, Robert (Server Storage)
> Sent: Saturday, July 12, 2014 6:20 PM
> To: Benjamin LaHaise
> Cc: Christoph Hellwig; Jeff Moyer; Jens Axboe; dgilb...@interlog.com;
> James Bottomley; Bart Van Assche; linux-s...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: RE: scsi-mq V2
> 
> > I will see if that solves the problem with the scsi-mq-3 tree, or
> > at least some of the bisect trees leading up to it.
> 
> scsi-mq-3 is still going after 45 minutes.  I'll leave it running
> overnight.
> 

That has been going strong for 18 hours, so I think that's the patch
we need.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RESEND][PATCH 06/10][SCSI]mpt2sas: For >2TB volumes, DirectDrive support sends IO's with LBA bit 31 to IR FW instead of DirectDrive

2014-07-13 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Martin K. Petersen


...
> Also, this is not touched by the patch, but you're then doing:
> 
> (*(__be32 *)(&cdb_ptr[i])) = cpu_to_be32(p_lba);
> 
> What if this is a 6-byte READ/WRITE command? You'll end up exceeding the
> size of the LBA field.

All this is inside:
if (cdb0 == READ_16 || cdb0 == READ_10 ||
cdb0 == WRITE_16 || cdb0 == WRITE_10) {

so READ_6 and WRITE_6 and all their oddities are not a problem here.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-07-12 Thread Elliott, Robert (Server Storage)

> I will see if that solves the problem with the scsi-mq-3 tree, or
> at least some of the bisect trees leading up to it.

scsi-mq-3 is still going after 45 minutes.  I'll leave it running
overnight.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-07-12 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Benjamin LaHaise [mailto:b...@kvack.org]
> Sent: Friday, 11 July, 2014 9:55 AM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; Jeff Moyer; Jens Axboe; dgilb...@interlog.com; James
> Bottomley; Bart Van Assche; linux-s...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: scsi-mq V2
...
> Can you try the below totally untested patch instead?  It looks like
> put_reqs_available() is not irq-safe.
> 

With that addition alone, fio still runs into the same problem.

I added the same fix to get_reqs_available, which also accesses 
kcpu->reqs_available, and the test has run for 35 minutes with 
no problem.

Patch applied:

diff --git a/fs/aio.c b/fs/aio.c
index e59bba8..8e85e26 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
 {
struct kioctx_cpu *kcpu;
+   unsigned long flags;
 
preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);
 
+   local_irq_save(flags);
kcpu->reqs_available += nr;
+
while (kcpu->reqs_available >= ctx->req_batch * 2) {
kcpu->reqs_available -= ctx->req_batch;
atomic_add(ctx->req_batch, &ctx->reqs_available);
}
 
+   local_irq_restore(flags);
preempt_enable();
 }
 
@@ -847,10 +851,12 @@ static bool get_reqs_available(struct kioctx *ctx)
 {
struct kioctx_cpu *kcpu;
bool ret = false;
+   unsigned long flags;
 
preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);
 
+   local_irq_save(flags);
if (!kcpu->reqs_available) {
int old, avail = atomic_read(&ctx->reqs_available);
 
@@ -869,6 +875,7 @@ static bool get_reqs_available(struct kioctx *ctx)
ret = true;
kcpu->reqs_available--;
 out:
+   local_irq_restore(flags);
preempt_enable();
return ret;
 }

--
I will see if that solves the problem with the scsi-mq-3 tree, or 
at least some of the bisect trees leading up to it.

A few other comments:

1. Those changes boost _raw_spin_lock_irqsave into first place
in perf top:

  6.59%  [kernel][k] _raw_spin_lock_irqsave
  4.37%  [kernel][k] put_compound_page
  2.87%  [scsi_debug][k] sdebug_q_cmd_hrt_complete
  2.74%  [kernel][k] _raw_spin_lock
  2.73%  [kernel][k] apic_timer_interrupt
  2.41%  [kernel][k] do_blockdev_direct_IO
  2.24%  [kernel][k] __get_page_tail
  1.97%  [kernel][k] _raw_spin_unlock_irqrestore
  1.87%  [kernel][k] scsi_queue_rq
  1.76%  [scsi_debug][k] schedule_resp

Maybe (later) kcpu->reqs_available should converted to an atomic,
like ctx->reqs_available, to reduce that overhead?

2. After the f8567a3 patch, aio_complete has one early return that 
bypasses the call to put_reqs_available.  Is that OK, or does
that mean that sync iocbs will now eat up reqs_available?

/*
 * Special case handling for sync iocbs:
 *  - events go directly into the iocb for fast handling
 *  - the sync task with the iocb in its stack holds the single iocb
 *ref, no other paths have a way to get another ref
 *  - the sync task helpfully left a reference to itself in the iocb
 */
if (is_sync_kiocb(iocb)) {
iocb->ki_user_data = res;
smp_wmb();
iocb->ki_ctx = ERR_PTR(-EXDEV);
wake_up_process(iocb->ki_obj.tsk);
return;
}


3. The f8567a3 patch renders this comment in aio.c out of date - it's 
no longer incremented when pulled off the ringbuffer, but is now 
incremented when aio_complete is called.

struct {
/*
 * This counts the number of available slots in the ringbuffer,
 * so we avoid overflowing it: it's decremented (if positive)
 * when allocating a kiocb and incremented when the resulting
 * io_event is pulled off the ringbuffer.
 *
 * We batch accesses to it with a percpu version.
 */
atomic_treqs_available;
} cacheline_aligned_in_smp;


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: mpt2sas stuck installing

2014-07-11 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Joe Lawrence
...
> In your crash stack trace, the scsi error handler has issued a host
> reset, but then crashed in mpt2sas_base_get_iocstate.  Reading through
> _scsih_shutdown, I don't believe that the mpt2sas .shutdown path takes
> any precaution before heading down mpt2sas_base_detach to free adapter
> resources.  The ordinary .remove path looks similar, though it does
> call sas_remove_host before freeing resources, *then* scsi_remove_host
> and scsi_host_put.
> 
> I wonder if this ordering needs to be reversed (and added to
> _scsih_shutdown) to properly de-register from the SCSI midlayer prior
> to removing the controller instance.
> 
> Regards,
> 
> -- Joe

Nagalakshmi was working on an mpt3sas patch for the scsi-mq tree 
to do just that.  I don't recall if the patch has made it into the 
scsi-mq tree yet. Apparently it's also needed for non-mq and mpt2sas.

It is making this change:
>   sas_remove_host(shost);
> + scsi_remove_host(shost);
>   mpt3sas_base_detach(ioc);
>   list_del(&ioc->list);
> - scsi_remove_host(shost);
>   scsi_host_put(shost);

We are making a similar change in hpsa.  Doing so led to a general 
protection fault, which unveiled that we also needed to change 
cancel_delayed_work() calls to cancel_delayed_work_sync() to 
ensure there are no worker functions still active after the 
scsi_host structure is unregistered.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-07-11 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Friday, 11 July, 2014 1:15 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jeff Moyer; Christoph Hellwig; Jens Axboe; dgilb...@interlog.com; James
> Bottomley; Bart Van Assche; Benjamin LaHaise; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On Fri, Jul 11, 2014 at 06:02:03AM +, Elliott, Robert (Server Storage)
> wrote:
> > Allowing longer run times before declaring success, the problem
> > does appear in all of the bisect trees.  I just let fio
> > continue to run for many minutes - no ^Cs necessary.
> >
> > no-rebase: good for > 45 minutes (I will leave that running for
> >   8 more hours)
> 
> Ok, thanks.  If it's still running tomorrow morning let's look into the
> aio reverts again.

That ran 9 total hours with no problem.

Rather than revert in the bisect trees, I added just this single additional
patch to the no-rebase tree, and the problem appeared:


48a2e94154177286b3bcbed25ea802232527fa7c
aio: fix aio request leak when events are reaped by userspace

diff --git a/fs/aio.c b/fs/aio.c
index 4f078c0..e59bba8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1021,6 +1021,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)

/* everything turned out well, dispose of the aiocb. */
kiocb_free(iocb);
+   put_reqs_available(ctx, 1); /* added by patch f8567 */

/*
 * We have to order our ring_info tail store above and test
@@ -1101,7 +1102,7 @@ static long aio_read_events_ring(struct kioctx *ctx,

pr_debug("%li  h%u t%u\n", ret, head, tail);

-   put_reqs_available(ctx, ret);
+   /* put_reqs_available(ctx, ret); removed by patch f8567 */
 out:
mutex_unlock(&ctx->ring_lock);


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Elliott, Robert (Server Storage)
> 
> I added some prints in aio_setup_ring and  ioctx_alloc and
> rebooted.  This time it took much longer to hit the problem.  It
> survived dozens of ^Cs.  Running a few minutes, though, IOPS
> eventually dropped.  So, sometimes it happens immediately,
> sometimes it takes time to develop.
> 
> I will rerun bisect-1 -2 and -3 for longer times to increase
> confidence that they didn't just appear good.

Allowing longer run times before declaring success, the problem 
does appear in all of the bisect trees.  I just let fio
continue to run for many minutes - no ^Cs necessary.

no-rebase: good for > 45 minutes (I will leave that running for
  8 more hours)
bisect-1: bad
bisect-2: bad
bisect-3: bad
bisect-4: bad


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Jeff Moyer [mailto:jmo...@redhat.com]
> Sent: Thursday, 10 July, 2014 2:14 PM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; Jens Axboe; dgilb...@interlog.com; James Bottomley;
> Bart Van Assche; Benjamin LaHaise; linux-s...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> "Elliott, Robert (Server Storage)"  writes:
> 
> >> -Original Message-
> >> From: Christoph Hellwig [mailto:h...@infradead.org]
> >> Sent: Thursday, 10 July, 2014 11:15 AM
> >> To: Elliott, Robert (Server Storage)
> >> Cc: Jens Axboe; dgilb...@interlog.com; James Bottomley; Bart Van Assche;
> >> Benjamin LaHaise; linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org
> >> Subject: Re: scsi-mq V2
> >>
> >> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
> >> > It's starting to look weird.  I'll prepare another two bisect branches
> >> > around some MM changes, which seems the only other possible candidate.
> >>
> >> I've pushed out scsi-mq.3-bisect-3
> >
> > Good.
> >
> >> and scsi-mq.3-bisect-4 for you.
> >
> > Bad.
> >
> > Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
> > diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
> > index df95a2f..11b65d4 100644
> > --- a/arch/x86/vdso/vdso2c.h
> > +++ b/arch/x86/vdso/vdso2c.h
> > @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct
> BITSFUNC(fake_sections) *out,
> > uint64_t flags = GET_LE(&in->sh_flags);
> >
> > bool copy = flags & SHF_ALLOC &&
> > +   (GET_LE(&in->sh_size) ||
> > +   (GET_LE(&in->sh_type) != SHT_RELA &&
> > +   GET_LE(&in->sh_type) != SHT_REL)) &&
> > strcmp(name, ".altinstructions") &&
> > strcmp(name, ".altinstr_replacement");
> >
> > Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
> > an fio hang, with one CPU (CPU 0) stuck in io_submit loops.
> 

I added some prints in aio_setup_ring and  ioctx_alloc and
rebooted.  This time it took much longer to hit the problem.  It 
survived dozens of ^Cs.  Running a few minutes, though, IOPS 
eventually dropped.  So, sometimes it happens immediately,
sometimes it takes time to develop.

I will rerun bisect-1 -2 and -3 for longer times to increase
confidence that they didn't just appear good.

On this bisect-4 run, as IOPS started to drop from 900K to 40K, 
I ran perf top when it was at 700K.  You can see io_submit times
creeping up.

  4.30%  [kernel][k] do_io_submit
  4.29%  [kernel][k] _raw_spin_lock_irqsave
  3.88%  libaio.so.1.0.1 [.] io_submit
  3.55%  [kernel][k] system_call
  3.34%  [kernel][k] put_compound_page
  3.11%  [kernel][k] io_submit_one
  3.06%  [kernel][k] system_call_after_swapgs
  2.89%  [kernel][k] copy_user_generic_string
  2.45%  [kernel][k] lookup_ioctx
  2.16%  [kernel][k] apic_timer_interrupt
  2.00%  [kernel][k] _raw_spin_lock
  1.97%  [scsi_debug][k] sdebug_q_cmd_hrt_complete
  1.84%  [kernel][k] __get_page_tail
  1.74%  [kernel][k] do_blockdev_direct_IO
  1.68%  [kernel][k] blk_flush_plug_list
  1.41%  [kernel][k] _raw_spin_unlock_irqrestore
  1.24%  [scsi_debug][k] schedule_resp

finally settling like before:
 14.15%  [kernel][k] do_io_submit
 13.61%  libaio.so.1.0.1 [.] io_submit
 11.81%  [kernel][k] system_call
 10.11%  [kernel][k] system_call_after_swapgs
  8.59%  [kernel][k] io_submit_one
  8.56%  [kernel][k] copy_user_generic_string
  7.96%  [kernel][k] lookup_ioctx
  5.33%  [kernel][k] blk_flush_plug_list
  3.11%  [kernel][k] blk_finish_plug
  2.84%  [kernel][k] sysret_check
  2.63%  fio [.] fio_libaio_commit
  2.27%  [kernel][k] blk_start_plug
  1.17%  [kernel][k] SyS_io_submit

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Thursday, 10 July, 2014 11:15 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jens Axboe; dgilb...@interlog.com; James Bottomley; Bart Van Assche;
> Benjamin LaHaise; linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
> > It's starting to look weird.  I'll prepare another two bisect branches
> > around some MM changes, which seems the only other possible candidate.
> 
> I've pushed out scsi-mq.3-bisect-3 

Good.

> and scsi-mq.3-bisect-4 for you.

Bad.

Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index df95a2f..11b65d4 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct 
BITSFUNC(fake_sections) *out,
uint64_t flags = GET_LE(&in->sh_flags);

bool copy = flags & SHF_ALLOC &&
+   (GET_LE(&in->sh_size) ||
+   (GET_LE(&in->sh_type) != SHT_RELA &&
+   GET_LE(&in->sh_type) != SHT_REL)) &&
strcmp(name, ".altinstructions") &&
strcmp(name, ".altinstr_replacement");

Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

perf top shows lookup_ioctx function alongside io_submit and
do_io_submit this time:
 14.96%  [kernel] [k] lookup_ioctx
 14.71%  libaio.so.1.0.1  [.] io_submit
 13.78%  [kernel] [k] system_call
 10.79%  [kernel] [k] system_call_after_swapgs
 10.17%  [kernel] [k] do_io_submit
  8.91%  [kernel] [k] copy_user_generic_string
  4.24%  [kernel] [k] io_submit_one
  3.93%  [kernel] [k] blk_flush_plug_list
  3.32%  fio  [.] fio_libaio_commit
  2.84%  [kernel] [k] sysret_check
  2.06%  [kernel] [k] blk_finish_plug
  1.89%  [kernel] [k] SyS_io_submit
  1.48%  [kernel] [k] blk_start_plug
  1.04%  fio  [.] io_submit@plt
  0.84%  [kernel] [k] __get_user_4
  0.74%  [kernel] [k] system_call_fastpath
  0.60%  [kernel] [k] _copy_from_user
  0.51%  diff [.] 0x7abb

ftrace on CPU 0 shows similar repetition to before:
 fio-4107  [000]    389.992300: lookup_ioctx <-do_io_submit
 fio-4107  [000]    389.992300: blk_start_plug <-do_io_submit
 fio-4107  [000]    389.992300: io_submit_one <-do_io_submit
 fio-4107  [000]    389.992300: blk_finish_plug <-do_io_submit
 fio-4107  [000]    389.992300: blk_flush_plug_list 
<-blk_finish_plug
 fio-4107  [000]    389.992301: SyS_io_submit 
<-system_call_fastpath
 fio-4107  [000]    389.992301: do_io_submit <-SyS_io_submit
 fio-4107  [000]    389.992301: lookup_ioctx <-do_io_submit
 fio-4107  [000]    389.992301: blk_start_plug <-do_io_submit
 fio-4107  [000]    389.992301: io_submit_one <-do_io_submit
 fio-4107  [000]    389.992301: blk_finish_plug <-do_io_submit
 fio-4107  [000]    389.992301: blk_flush_plug_list 
<-blk_finish_plug
 fio-4107  [000]    389.992301: SyS_io_submit 
<-system_call_fastpath
 fio-4107  [000]    389.992302: do_io_submit <-SyS_io_submit
 fio-4107  [000]    389.992302: lookup_ioctx <-do_io_submit
 fio-4107  [000]    389.992302: blk_start_plug <-do_io_submit
 fio-4107  [000]    389.992302: io_submit_one <-do_io_submit
 fio-4107  [000]    389.992302: blk_finish_plug <-do_io_submit
 fio-4107  [000]    389.992302: blk_flush_plug_list 
<-blk_finish_plug



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Thursday, 10 July, 2014 1:21 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jens Axboe; dgilb...@interlog.com; Christoph Hellwig; James Bottomley;
> Bart Van Assche; Benjamin LaHaise; linux-s...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On Thu, Jul 10, 2014 at 12:53:36AM +, Elliott, Robert (Server Storage)
> wrote:
> > the problem still occurs - fio results in low or 0 IOPS, with perf top
> > reporting unusual amounts of time spent in do_io_submit and io_submit.
> 
> The diff between the two version doesn't show too much other possible
> interesting commits, the most interesting being some minor block
> updates.
> 
> I guess we'll have to a manual bisect, I've pushed out a

> scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
> the block tree

good.

> and a scsi-mq.3-bisect-2 branch that is just after the merge of the 
> block tree to get started.

good.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-07-10 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Thursday, 10 July, 2014 8:53 AM
> To: Christoph Hellwig; Benjamin LaHaise
> Cc: Elliott, Robert (Server Storage); dgilb...@interlog.com; James Bottomley;
> Bart Van Assche; linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On 2014-07-10 15:50, Christoph Hellwig wrote:
> > On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
> >> There is one possible concern that could be exacerbated by other changes
> in
> >> the system: if the application is running close to the bare minimum number
> >> of requests allocated in io_setup(), the per cpu reference counters will
> >> have a hard time batching things.  It might be worth testing with an
> >> increased number of requests being allocated if this is the case.
> >
> > Well, Robert said reverting the two aio commits didn't help.  Either he
> > didn't manage to boot into the right kernel, or we need to look
> > elsewhere for the culprit.
> 
> Rob, let me know what scsi_debug setup you use, and I can try and
> reproduce it here as well.
> 
> --
> Jens Axboe

This system has 6 online CPUs and 64 possible CPUs.

Printing avail and req_batch in that loop results in many of these:
** 3813 printk messages dropped ** [10643.503772] ctx 88042d8d4cc0 avail=0 
req_batch=2

Adding CFLAGS_aio.o := -DDEBUG to the Makefile to enable
those pr_debug prints results in nothing extra printing,
so it's not hitting an error.

Printing nr_events and aio_max_nr at the top of ioctx_alloc results in
these as fio starts:

[  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339070] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339074] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339076] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339076] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359772] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359971] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359972] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359985] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.359986] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359987] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359995] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359995] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359998] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359998] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.362529] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.362529] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.363510] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363513] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363520] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363521] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398113] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398115] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398122] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398124] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398124] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398130] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398131] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398164] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398165] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398499] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.400489] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.401478] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.401491] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434522] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.434523] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434526] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434533] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.435370] hrtimer: interrupt took 6868 ns
[  186.435491] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.435492] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.447864] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.449896] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449900] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449901] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.449909] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.449932] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449933] ioctx

RE: scsi-mq V2

2014-07-09 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Wednesday, 09 July, 2014 2:38 PM
> To: dgilb...@interlog.com; Christoph Hellwig; James Bottomley; Bart Van
> Assche; Elliott, Robert (Server Storage); linux-s...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On 2014-07-09 18:39, Douglas Gilbert wrote:
> > On 14-07-08 10:48 AM, Christoph Hellwig wrote:
> >> On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
> >>> Changes from V1:
> >>>   - rebased on top of the core-for-3.17 branch, most notable the
> >>> scsi logging changes
> >>>   - fixed handling of cmd_list to prevent crashes for some heavy
> >>> workloads
> >>>   - fixed incorrect handling of !target->can_queue
> >>>   - avoid scheduling a workqueue on I/O completions when no queues
> >>> are congested
> >>>
> >>> In addition to the patches in this thread there also is a git
> >>> available at:
> >>>
> >>> git://git.infradead.org/users/hch/scsi.git scsi-mq.2
> >>
> >>
> >> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
> >> latest core-for-3.17 tree + the "RFC: clean up command setup" series
> >> from June 29th.  Robert Elliot found a problem with not fully zeroed
> >> out UNMAP CDBs, which is fixed by the saner discard handling in that
> >> series.
> >>
> >> There is a new patch to factor the code from the above series for
> >> blk-mq use, which I've attached below.  Besides that the only changes
> >> are minor merge fixups in the main blk-mq usage patch.
> >
> > Be warned: both Rob Elliott and I can easily break
> > the scsi-mq.3 branch. It seems as though a regression
> > has slipped in. I notice that Christoph has added a
> > new branch called "scsi-mq.3-no-rebase".
> 
> Rob/Doug, those issues look very much like problems in the aio code. Can
> either/both of you try with:
> 
> f8567a3845ac05bb28f3c1b478ef752762bd39ef
> edfbbf388f293d70bf4b7c0bc38774d05e6f711a
> 
> reverted (in that order) and see if that changes anything.
> 
> 
> --
> Jens Axboe

scsi-mq.3-no-rebase, which has all the scsi updates from scsi-mq.3
but is based on 3.16.0-rc2 rather than 3.16.0-rc4, works fine:
* ^C exits fio cleanly with scsi_debug devices
* ^C exits fio cleanly with mpt3sas devices
* fio hits 1M IOPS with 16 hpsa devices
* fio hits 700K IOPS with 6 mpt3sas devices
* 38 device test to mpt3sas, hpsa, and scsi_debug devices runs OK


With:
* scsi-mq-3, which is based on 3.16.0-rc4
* [PATCH] x86-64: fix vDSO build from https://lkml.org/lkml/2014/7/3/738
* those two aio patches reverted

the problem still occurs - fio results in low or 0 IOPS, with perf top 
reporting unusual amounts of time spent in do_io_submit and io_submit.

perf top:
 14.38%  [kernel]  [k] do_io_submit
 13.71%  libaio.so.1.0.1   [.] io_submit
 13.32%  [kernel]  [k] system_call
 11.60%  [kernel]  [k] system_call_after_swapgs
  8.88%  [kernel]  [k] lookup_ioctx
  8.78%  [kernel]  [k] copy_user_generic_string
  7.78%  [kernel]  [k] io_submit_one
  5.97%  [kernel]  [k] blk_flush_plug_list
  2.73%  fio   [.] fio_libaio_commit
  2.70%  [kernel]  [k] sysret_check
  2.68%  [kernel]  [k] blk_finish_plug
  1.98%  [kernel]  [k] blk_start_plug
  1.17%  [kernel]  [k] SyS_io_submit
  1.17%  [kernel]  [k] __get_user_4
  0.99%  fio   [.] io_submit@plt
  0.85%  [kernel]  [k] _copy_from_user
  0.79%  [kernel]  [k] system_call_fastpath

Repeating some of last night's investigation details for the lists:

ftrace of one of the CPUs for all functions shows these 
are repeatedly being called:
 
   <...>-34508 [004]   6360.790714: io_submit_one <-do_io_submit
   <...>-34508 [004]   6360.790714: blk_finish_plug <-do_io_submit
   <...>-34508 [004]   6360.790714: blk_flush_plug_list 
<-blk_finish_plug
   <...>-34508 [004]   6360.790714: SyS_io_submit 
<-system_call_fastpath
   <...>-34508 [004]   6360.790715: do_io_submit <-SyS_io_submit
   <...>-34508 [004]   6360.790715: lookup_ioctx <-do_io_submit
   <...>-34508 [004]   6360.790715: blk_start_plug <-do_io_submit
   <...>-34508 [004]   6360.790715: io_submit_one <-do_io_submit
   <...>-34508 [004]   6360.790715: blk_finish_plug <-do_io_submit
   <...>-345

RE: [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn

2014-07-08 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Wednesday, 25 June, 2014 11:52 AM
> To: James Bottomley
> Cc: Jens Axboe; Bart Van Assche; Elliott, Robert (Server Storage); linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH 03/14] scsi: centralize command re-queueing in
> scsi_dispatch_fn
> 
> Make sure we only have the logic for requeing commands in one place.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/scsi.c |   35 ---
>  drivers/scsi/scsi_lib.c |9 ++---
>  2 files changed, 18 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index ce5b4e5..dcc43fd 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -648,9 +648,7 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>* returns an immediate error upwards, and signals
>* that the device is no longer present */
>   cmd->result = DID_NO_CONNECT << 16;
> - scsi_done(cmd);
> - /* return 0 (because the command has been processed) */
> - goto out;
> + goto done;
>   }
> 
>   /* Check to see if the scsi lld made this device blocked. */
> @@ -662,17 +660,9 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>* occur until the device transitions out of the
>* suspend state.
>*/
> -
> - scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
> -
>   SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
>   "queuecommand : device blocked\n"));
> -
> - /*
> -  * NOTE: rtn is still zero here because we don't need the
> -  * queue to be plugged on return (it's already stopped)
> -  */
> - goto out;
> + return SCSI_MLQUEUE_DEVICE_BUSY;
>   }
> 
>   /*
> @@ -696,20 +686,19 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>  "cdb_size=%d host->max_cmd_len=%d\n",
>  cmd->cmd_len, cmd->device->host->max_cmd_len));
>   cmd->result = (DID_ABORT << 16);
> -
> - scsi_done(cmd);
> - goto out;
> + goto done;
>   }
> 
>   if (unlikely(host->shost_state == SHOST_DEL)) {
>   cmd->result = (DID_NO_CONNECT << 16);
> - scsi_done(cmd);
> - } else {
> - trace_scsi_dispatch_cmd_start(cmd);
> - cmd->scsi_done = scsi_done;
> - rtn = host->hostt->queuecommand(host, cmd);
> + goto done;
> +
>   }
> 
> + trace_scsi_dispatch_cmd_start(cmd);
> +
> + cmd->scsi_done = scsi_done;
> + rtn = host->hostt->queuecommand(host, cmd);
>   if (rtn) {
>   trace_scsi_dispatch_cmd_error(cmd, rtn);
>   if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
> @@ -718,12 +707,12 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> 
>   SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
>   "queuecommand : request rejected\n"));
> -
> - scsi_queue_insert(cmd, rtn);
>   }
> 
> - out:
>   return rtn;
> + done:
> + scsi_done(cmd);
> + return 0;
>  }
> 

Related to the position of the trace_scsi_dispatch_cmd_start()
call... this function does:

1. check sdev_state - goto done
2. check scsi_device_blocked() - return
3. put LUN into CDB for ancient SCSI-1 devices
4. scsi_log_send()
5. check cmd_len - goto done
6. check shost_state - goto done
7. trace_scsi_dispatch_cmd_start()
8. queuecommand()
9. return
10. done:
cmd->scsi_done(cmd)  [PATCH 04/14 upgrades it to this]
return 0;

It's inconsistent for logging and tracing to occur after 
different number of checks.

In scsi_lib.c, both scsi_done() and scsi_mq_done() always call
trace_scsi_dispatch_cmd_done(), so trace_scsi_dispatch_cmd_start()
should be called before scsi_done() is called.  That way the
trace will always have a submission to match each completion.

That means trace should be called before the sdev_state check 
(which calls scsi_done()).  

I don't know about the scsi_device_blocked check (which just 
returns).  Should the trace record multiple submissions with 
one completion?  Maybe both trace_scsi_dispatch_cmd_start() 
and trace_scsi_dispatch_cmd_done() should both be called?

scsi_log_completion() is called by scsi_softirq_done() and
scsi_times_out() but not by scsi_done() and scsi_mq_done(), so 
scsi_log_send() should not be called unless all t

RE: [PATCH] qla4xxx: Return -ENOMEM on memory allocation failure

2014-07-04 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Himangi Saraogi
> Sent: Friday, 04 July, 2014 1:28 PM
> To: Vikas Chaudhary; iscsi-dri...@qlogic.com; James E.J. Bottomley; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: julia.law...@lip6.fr
> Subject: [PATCH] qla4xxx: Return -ENOMEM on memory allocation failure
> 
> In this code, 0 is returned on memory allocation failure, even though
> other failures return -ENOMEM or other similar values.
> 
> A simplified version of the Coccinelle semantic match that finds this
> problem is as follows:
> 
> // 
> @@
> expression ret;
> expression x,e1,e2,e3;
> identifier alloc;
> @@
> 
> ret = 0
> ... when != ret = e1
> *x = alloc(...)
> ... when != ret = e2
> if (x == NULL) { ... when != ret = e3
>   return ret;
> }
> // 
> 
> Signed-off-by: Himangi Saraogi 
> Acked-by: Julia Lawall 
> ---
>  drivers/scsi/qla4xxx/ql4_os.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/scsi/qla4xxx/ql4_os.c b/drivers/scsi/qla4xxx/ql4_os.c
> index c5d9564..72ba671 100644
> --- a/drivers/scsi/qla4xxx/ql4_os.c
> +++ b/drivers/scsi/qla4xxx/ql4_os.c
> @@ -1050,6 +1050,7 @@ static int qla4xxx_get_host_stats(struct Scsi_Host
> *shost, char *buf, int len)
>   if (!ql_iscsi_stats) {
>   ql4_printk(KERN_ERR, ha,
>  "Unable to allocate memory for iscsi stats\n");
> + ret = -ENOMEM;
>   goto exit_host_stats;
>   }
> 

Also, the only caller of this function doesn't use the return 
value - it's overwritten by another function call:

drivers/scsi/scsi_transport_iscsi.c:
err = transport->get_host_stats(shost, buf, host_stats_size);

actual_size = nlmsg_total_size(sizeof(*ev) + host_stats_size);
skb_trim(skbhost_stats, NLMSG_ALIGN(actual_size));
nlhhost_stats->nlmsg_len = actual_size;

err = iscsi_multicast_skb(skbhost_stats, ISCSI_NL_GRP_ISCSID,
  GFP_KERNEL);


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq V2

2014-06-26 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Wednesday, 25 June, 2014 11:51 PM
> To: Christoph Hellwig; James Bottomley
> Cc: Bart Van Assche; Elliott, Robert (Server Storage); linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On 2014-06-25 10:51, Christoph Hellwig wrote:
> > This is the second post of the scsi-mq series.
> >
...
> >
> > Changes from V1:
> >   - rebased on top of the core-for-3.17 branch, most notable the
> > scsi logging changes
> >   - fixed handling of cmd_list to prevent crashes for some heavy
> > workloads
> >   - fixed incorrect handling of !target->can_queue
> >   - avoid scheduling a workqueue on I/O completions when no queues
> > are congested
> >
> > In addition to the patches in this thread there also is a git available at:
> >
> > git://git.infradead.org/users/hch/scsi.git scsi-mq.2
> 
> You can add my acked/reviewed-by to the series.
> 
> --
> Jens Axboe

Since March 20th (circa LSF-MM 2014) we've run many hours of tests
with hpsa and the scsi-mq tree.  We've also done a little bit of 
testing with mpt3sas and, in the last few days, scsi_debug.

Although there are certainly more problems to find and improvements
to be made, it's become quite stable.  It's even been used on the
boot drives of our test servers.

For the patches in scsi-mq.2 you may add:
Tested-by: Robert Elliott 


---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit

2014-06-21 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Thursday, 12 June, 2014 8:49 AM
> To: James Bottomley
> Cc: Jens Axboe; Bart Van Assche; Elliott, Robert (Server Storage); linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH 10/14] scsi: only maintain target_blocked if the driver has a
> target queue limit
> 
> This saves us an atomic operation for each I/O submission and completion
> for the usual case where the driver doesn't set a per-target can_queue
> value.  Only a few iscsi hardware offload drivers set the per-target
> can_queue value at the moment.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/scsi_lib.c |   17 -
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 0e33dee..763b3c9 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
...

> @@ -1642,7 +1648,8 @@ static void scsi_request_fn(struct request_queue *q)
>   return;
> 
>   host_not_ready:
> - atomic_dec(&scsi_target(sdev)->target_busy);
> + if (&scsi_target(sdev)->can_queue > 0)
> + atomic_dec(&scsi_target(sdev)->target_busy);
>   not_ready:
>   /*
>* lock q, handle tag, requeue req, and decrement device_busy. We

There's an extra & in that if statement.

---
Rob ElliottHP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq

2014-06-20 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: Bart Van Assche [mailto:bvanass...@acm.org]
> Sent: Wednesday, 18 June, 2014 2:09 AM
> To: Jens Axboe; Christoph Hellwig; James Bottomley
> Cc: Elliott, Robert (Server Storage); linux-s...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: scsi-mq
> 
...
> Hello Jens,
> 
> Fio reports the same queue depth for use_blk_mq=Y (mq below) and
> use_blk_mq=N (sq below), namely ">=64". However, the number of context
> switches differs significantly for the random read-write tests.
> 
...
> It seems like with the traditional SCSI mid-layer and block core (sq)
> that the number of context switches does not depend too much on the
> number of I/O operations but that for the multi-queue SCSI core there
> are a little bit more than two context switches per I/O in the
> particular test I ran. The "randrw" script I used for this test takes
> SCSI LUNs as arguments (/dev/sdX) and starts the fio tool as follows:

Some of those context switches might be from scsi_end_request(), 
which always schedules the scsi_requeue_run_queue() function via the
requeue_work workqueue for scsi-mq.  That causes lots of context 
switches from a busy application thread (e.g., fio) to a 
kworker thread.

As shown by ftrace:

 fio-19340 [005] dNh. 12067.908444: scsi_io_completion 
<-scsi_finish_command
 fio-19340 [005] dNh. 12067.908444: scsi_end_request 
<-scsi_io_completion
 fio-19340 [005] dNh. 12067.908444: blk_update_request 
<-scsi_end_request
 fio-19340 [005] dNh. 12067.908445: blk_account_io_completion 
<-blk_update_request
 fio-19340 [005] dNh. 12067.908445: scsi_mq_free_sgtables 
<-scsi_end_request
 fio-19340 [005] dNh. 12067.908445: scsi_free_sgtable 
<-scsi_mq_free_sgtables
 fio-19340 [005] dNh. 12067.908445: blk_account_io_done 
<-__blk_mq_end_io
 fio-19340 [005] dNh. 12067.908445: blk_mq_free_request 
<-__blk_mq_end_io
 fio-19340 [005] dNh. 12067.908446: blk_mq_map_queue 
<-blk_mq_free_request
 fio-19340 [005] dNh. 12067.908446: blk_mq_put_tag 
<-__blk_mq_free_request
 fio-19340 [005] .N.. 12067.908446: blkdev_direct_IO 
<-generic_file_direct_write
kworker/5:1H-3207  [005]  12067.908448: scsi_requeue_run_queue 
<-process_one_work
kworker/5:1H-3207  [005]  12067.908448: scsi_run_queue 
<-scsi_requeue_run_queue
kworker/5:1H-3207  [005]  12067.908448: blk_mq_start_stopped_hw_queues 
<-scsi_run_queue
 fio-19340 [005]  12067.908449: blk_start_plug 
<-do_blockdev_direct_IO
 fio-19340 [005]  12067.908449: blkdev_get_block <-do_direct_IO
 fio-19340 [005]  12067.908450: blk_throtl_bio 
<-generic_make_request_checks
 fio-19340 [005]  12067.908450: blk_sq_make_request 
<-generic_make_request
 fio-19340 [005]  12067.908450: blk_queue_bounce 
<-blk_sq_make_request
 fio-19340 [005]  12067.908450: blk_mq_map_request 
<-blk_sq_make_request
 fio-19340 [005]  12067.908451: blk_mq_queue_enter 
<-blk_mq_map_request
 fio-19340 [005]  12067.908451: blk_mq_map_queue 
<-blk_mq_map_request
 fio-19340 [005]  12067.908451: blk_mq_get_tag 
<-__blk_mq_alloc_request
 fio-19340 [005]  12067.908451: blk_mq_bio_to_request 
<-blk_sq_make_request
 fio-19340 [005]  12067.908451: blk_rq_bio_prep 
<-init_request_from_bio
 fio-19340 [005]  12067.908451: blk_recount_segments 
<-bio_phys_segments
 fio-19340 [005]  12067.908452: blk_account_io_start 
<-blk_mq_bio_to_request
 fio-19340 [005]  12067.908452: blk_mq_hctx_mark_pending 
<-__blk_mq_insert_request
 fio-19340 [005]  12067.908452: blk_mq_run_hw_queue 
<-blk_sq_make_request
 fio-19340 [005]  12067.908452: blk_mq_start_request 
<-__blk_mq_run_hw_queue

In one snapshot just tracing scsi_end_request() and
scsi_request_run_queue(), 30K scsi_end_request() calls yielded 
20k scsi_request_run_queue() calls.

In this case, blk_mq_start_stopped_hw_queues() doesn't end up
doing anything since there aren't any stopped queues to restart 
(blk_mq_run_hw_queue() gets called a bit later during routine 
fio work); the context switch turned out to be a waste of time.  
If it did find a stopped queue, then it would call 
blk_mq_run_hw_queue() itself.

---
Rob ElliottHP Server Storage

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: scsi-mq

2014-06-18 Thread Elliott, Robert (Server Storage)

> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Tuesday, 17 June, 2014 10:45 PM
> To: Bart Van Assche; Christoph Hellwig; James Bottomley
> Cc: Bart Van Assche; Elliott, Robert (Server Storage); linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq
> 
> On 2014-06-17 07:27, Bart Van Assche wrote:
> > On 06/12/14 15:48, Christoph Hellwig wrote:
> >> Bart and Robert have helped with some very detailed measurements that they
> >> might be able to send in reply to this, although these usually involve
> >> significantly reworked low level drivers to avoid other bottle necks.
> >
> > In case someone would like to see the results of the measurements I ran,
> > these results can be found here:
> > https://docs.google.com/file/d/0B1YQOreL3_FxUXFMSjhmNDBNNTg.
> >
> > Two important conclusions from the data in that PDF document are as
> follows:
> > - A small but significant performance improvement for the traditional
> >SCSI mid-layer (use_blk_mq=N).
> > - A very significant performance improvement for multithreaded
> >workloads with use_blk_mq=Y. As an example, the number of I/O
> >operations per second reported for the random write test increased
> >with 170%. That means 2.7 times the performance
> >of use_blk_mq=N.
> 
> Thanks for posting these numbers, Bart. The CPU utilization and IOPS
> speak a very clear message. The only mystery is why the singe threaded
> performance is down. That we need to get sort, but it's not a show
> stopper for inclusion.
> 
> If you run the single threaded tests and watch for queue depths, is
> there a difference between blk-mq=y/scsi-mq and the stock kernel?
> 
> > I think this means the scsi-mq patches are ready for wider use.
> 
> I would agree. James, I haven't seen any comments from you on this yet.
> I've run various bits of scsi-mq testing as well, and no ill effects
> seen. On top of that, Christophs patches are nicely separated and have
> general benefits even for the non-blk-mq cases. Time to shove them into
> the queue for the next merge window?
> 
> --
> Jens Axboe

We've been testing the hpsa driver extensively with the scsi-mq-wip trees.
I don't have numbers with the latest scsi-mq tree yet, but here are some
performance numbers from scsi-mq-wip.5 through 7.  

scsi-mq slightly underperformed non-scsi-mq when using multiple devices:
* normal975K IOPS (16 devices each made from 1 drive)
* scsi-mq-wip.5 905K IOPS (16 devices each made from 1 drive)
* scsi-mq-wip.6+969K IOPS (16 devices... 3 threads per device)

but was much better when using a single device:
* normal166K IOPS (1 device made from 8 drives, 1 thread)   

* normal266K IOPS (1 device made from 8 drives, 12 threads)
* scsi-mq-wip.5 880K IOPS (1 device made from 8 drives, 12 threads)

* normal266K IOPS (1 device made from 16 drives, 12 threads)
* scsi-mq-wip.5 973K IOPS (1 device made from 16 drives, 12 threads)
* scsi-mq-wip.6+979K IOPS (1 device made from 16 drives, 12 threads)

The headline improvement is that one device can reach the same performance 
as multiple devices - no more bottleneck in per-device queue locks limiting 
performance to around 266K IOPS per device.  Even the scsi_debug driver in
fake_rw mode hits that limit.

hpsa is limited to one submission queue, so submissions from multiple CPUs 
still meet inside the driver - SCSI Express will keep them isolated all 
the way.  hpsa supports one completion queue per CPU, so completions are 
already isolated.

The blk-mq bitmap tag allocator is working much better than its 
predecessor, but some combinations of active CPUs and devices still 
result in low queue depths for some devices.

We haven't fully tested cases where the hardware interrupt is handled
on a different CPU than the block layer wants to run its completion
processing per rq_affinity. That was previously scheduled as a softirq,
but is now handled directly in hardirq processing with IPIs.  This
changes the CPU utilization %soft and %hard metrics:
* normal5% hard, 25% soft
* scsi-mq   30% hard, 0% soft
(with something like 5% usr, 55% sys, 8% iowait idle, 2% idle)

Configuration:
* HP ProLiant DL380p Gen8 with 6 CPU hyperthreading cores (12 logical cores)
* lockless hpsa driver (forthcoming patches with performance 
  improvements such as eliminating locks, plus improved error handling)
* Smart Array P431 RAID controller
* 16 12 Gb/s SAS SSDs
* fio: 4 KiB random reads with options:
  direct=1, ioengine=libaio, norandommap, randrepeat=0,
  iodepth=96 or 1024, numjobs=1 or 12, thread, 
  cpus_allowed=0-11, cpus_allowed_policy=split,
  iodepth_batch=4, iodepth_batch_comple

1 2 >

1 - 100 of 108 matches

Mail list logo