date:20170323

Re: [PATCH] lsscsi: Fix truncation of 128-bit wwn

2017-03-23 Thread Vaibhav Jain

Hi Gris,

Thanks for reviewing the patch. I have posted a v2 patch addressing your
review comments. Can you please take a look.

Patch-Link:
https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg59720.html

Gris Ge  writes:

> On Fri, Mar 17, 2017 at 10:08:40AM +0530, Vaibhav Jain wrote:
>> ---
>>  src/lsscsi.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/src/lsscsi.c b/src/lsscsi.c
>> index 974b3f1..f20adcf 100644
>> --- a/src/lsscsi.c
>> +++ b/src/lsscsi.c
>> @@ -210,8 +210,8 @@ struct dev_node_list {
>>  static struct dev_node_list* dev_node_listhead = NULL;
>>  
>>  struct disk_wwn_node_entry {
>> -   char wwn[32];
>> -   char disk_bname[12];
>> +char wwn[35]; /* '0x' + wwn<128-bit> +  */
> Please consider to define a constant instead of using a magic number
> everywhere. Like:
> #define _DISK_WWN_MAX_LEN   35
> /* ^ WWN here is extracted from /dev/disk/by-id/wwn- which is
>  *   created by udev 60-persistent-storage.rules using ID_WWN_WITH_EXTENSION.
>  *   The udev ID_WWN_WITH_EXTENSION is the combination of char wwn[17] and
>  *   char wwn_vendor_extension[17] from struct scsi_id_device.
>  */
>> +char disk_bname[12];
> Please use space instead of \t for indention.
>>  };
>>  
>>  #define DISK_WWN_NODE_LIST_ENTRIES 16
>> @@ -2939,14 +2939,14 @@ one_sdev_entry(const char * dir_name, const char * 
>> devname,
>>  }
>>  if (wd[0]) {
>>  char dev_node[LMAX_NAME] = "";
>> -char wwn_str[34];
>> +char wwn_str[35];
> Please use space instead of \t for indention.
>>  enum dev_type typ;
>>  
>>  typ = (FT_BLOCK == non_sg.ft) ? BLK_DEV : CHR_DEV;
>>  if (get_wwn) {
>>  if ((BLK_DEV == typ) &&
>>  get_disk_wwn(wd, wwn_str, 
>> sizeof(wwn_str)))
>> -printf("%-30s  ", wwn_str);
>> +printf("%-34s  ", wwn_str);
> Please use constant instead of magic number:
> printf("%-*s  ", _DISK_WWN_MAX_LEN - 1, wwn_str);
>>  else
>>  printf("  "
>> "  ");
>> -- 
>> 2.9.3
>> 
>
> -- 
> Gris Ge

-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.

Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-23 Thread Lars Ellenberg

On Thu, Mar 23, 2017 at 01:02:22PM -0400, Mike Snitzer wrote:
> On Thu, Mar 23 2017 at 11:54am -0400,
> Lars Ellenberg  wrote:
> 
> > On Thu, Mar 23, 2017 at 10:33:18AM -0400, Christoph Hellwig wrote:
> > > This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> > > supported by the block layer, and switches existing implementations
> > > of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> > > removes incorrect discard_zeroes_data, and also switches WRITE SAME
> > > based zeroing in SCSI to this new method.
> > > 
> > > I've done testing with ATA, SCSI and NVMe setups, but there are
> > > a few things that will need more attention:
> > > 
> > 
> > >  - The DRBD code in this area was very odd,
> > 
> > DRBD wants all replicas to give back identical data.
> > If what comes back after a discard is "undefined",
> > we cannot really use that.
> > 
> > We used to "stack" discard only if our local backend claimed
> > "discard_zeroes_data". We replicate that IO request to the peer
> > as discard, and if the peer cannot do discards itself, or has
> > discard_zeroes_data == 0, the peer will use zeroout instead.
> > 
> > One use-case for this is the device mapper "thin provisioning".
> > At the time I wrote those "odd" hacks, dm thin targets
> > would set discard_zeroes_data=0, NOT change discard granularity,
> > but only actually discard (drop from the tree) whole "chunks",
> > leaving partial start/end chunks in the mapping tree unchanged.
> > 
> > The logic of "only stack discard, if backend discard_zeroes_data"

That is DRBDs logic I just explained above.
And the "backend" (to DRBD) in that sentence would be thin, and not
the "real" hardware below thin, which may not even support discard.

> > would mean that we would not be able to accept and pass down discards
> > to dm-thin targets. But with data on dm-thin, you would really like
> > to do the occasional fstrim.
> 
> Are you sure you aren't thinking of MD raid?

Yes.

> To this day, dm-thin.c has: ti->discard_zeroes_data_unsupported = true

That is exactly what I was saying.

Thin does not claim to zero data on discard.  which is ok, and correct,
because it only punches holes on full chunks (or whatever you call
them), and leaves the rest in the mapping tree as is.

And that behaviour would prevent DRBD from exposing discards if
configured on top of thin. (see above)

But thin *could* easily guarantee zeroing, by simply punching holes
where it can, and zeroing out the not fully-aligned partial start and
end of the range.

Which is what I added as an option between DRBD and whatever is below,
with the use-case of dm-thin in mind.

And that made it possible for DRBD to
 a) expose "discard" to upper layers, even if we would usually only do
if the DRBD Primary sits on top of a device that guarantees discard
zeros data,
 b) still use discards on a secondary, without falling back to zero-out,
which would unexpectedly fully allocate, instead of trim, a thinly
provisioned device-mapper target.

Thanks,

Lars

Re: [PATCH] qedf: fix wrong le16 conversion

2017-03-23 Thread Martin K. Petersen

David Miller  writes:

>> Dave: Since you queued the firmware patch, mind taking this fix through
>> your tree?
>
> Ok, applied to net-next, thanks.

Great, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: iscsi (?): Unknown VPD Code: 0xc7

2017-03-23 Thread Martin K. Petersen

Harald Dunkel  writes:

> Mar 23 19:06:30 nasl003b kernel: [38396.624678] Unknown VPD Code: 0xc7
>
> What is it trying to tell me?

That is an IBM-specific VPD page containing device-unique configuration
data. I guess it is a prerequisite for installing AIX.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] scsi: osd_uld: remove an unneeded NULL check

2017-03-23 Thread Boaz Harrosh

On 03/23/2017 12:41 PM, Dan Carpenter wrote:
> We don't call the remove() function unless probe() succeeds so "oud"
> can't be NULL here.  Plus, if it were NULL, we dereference it on the
> next line so it would crash anyway.
> 
> Signed-off-by: Dan Carpenter 
> 

Thanks sure!
ACK-by Boaz Harrosh 

> diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c
> index 4101c3178411..8b9941a5687a 100644
> --- a/drivers/scsi/osd/osd_uld.c
> +++ b/drivers/scsi/osd/osd_uld.c
> @@ -507,10 +507,9 @@ static int osd_remove(struct device *dev)
>   struct scsi_device *scsi_device = to_scsi_device(dev);
>   struct osd_uld_device *oud = dev_get_drvdata(dev);
>  
> - if (!oud || (oud->od.scsi_device != scsi_device)) {
> - OSD_ERR("Half cooked osd-device %p,%p || %p!=%p",
> - dev, oud, oud ? oud->od.scsi_device : NULL,
> - scsi_device);
> + if (oud->od.scsi_device != scsi_device) {
> + OSD_ERR("Half cooked osd-device %p, || %p!=%p",
> + dev, oud->od.scsi_device, scsi_device);
>   }
>  
>   cdev_device_del(&oud->cdev, &oud->class_dev);
>

iscsi (?): Unknown VPD Code: 0xc7

2017-03-23 Thread Harald Dunkel

Hi folks,

I am trying to use LIO on amd64 to provide block devices to some
AIX 6.1 and 7.1 hosts. It works pretty well, except that I cannot
create rootvg wpars on AIX 7.1. (rootvg wpars are containers
running on their own volume group.)

The error message on AIX is

bash-4.3# mkwpar -O -D rootvg=yes devname=hdisk3 -n sample
**
ERROR
mkwpar: 0960-587 hdisk3 has un-supported subclass type.

**


Kernel is 4.9.13 (Debian backports). It shows several (maybe unrelated)
lines in kernel.log

:
Mar 23 08:46:05 nasl003b kernel: [ 1172.226643] Unknown VPD Code: 0xc7
Mar 23 08:46:05 nasl003b kernel: [ 1172.279368] Unknown VPD Code: 0xc7
Mar 23 08:46:06 nasl003b kernel: [ 1172.331243] Unknown VPD Code: 0xc7
Mar 23 08:46:06 nasl003b kernel: [ 1172.380074] Unknown VPD Code: 0xc7
Mar 23 08:47:02 nasl003b kernel: [ 1229.106384] Unknown VPD Code: 0xc7
Mar 23 19:06:30 nasl003b kernel: [38396.514881] Unknown VPD Code: 0xc7
Mar 23 19:06:30 nasl003b kernel: [38396.576499] Unknown VPD Code: 0xc7
Mar 23 19:06:30 nasl003b kernel: [38396.624678] Unknown VPD Code: 0xc7
:

What is it trying to tell me?


Every helpful hint is highly appreciated
Harri

Re: [PATCH] qedf: fix wrong le16 conversion

2017-03-23 Thread David Miller

From: "Martin K. Petersen" 
Date: Thu, 23 Mar 2017 10:19:03 -0400

> Arnd Bergmann  writes:
> 
>> gcc points out that we are converting a 16-bit integer into a 32-bit
>> little-endian type and assigning that to 16-bit little-endian
>> will end up with a zero:
>>
>> drivers/scsi/qedf/drv_fcoe_fw_funcs.c: In function 
>> 'init_initiator_rw_fcoe_task':
>> include/uapi/linux/byteorder/big_endian.h:32:26: error: large integer 
>> implicitly truncated to unsigned type [-Werror=overflow]
>>   t_st_ctx->read_write.rx_id = cpu_to_le32(FCOE_RX_ID);
>>
>> The correct solution appears to be to just use a 16-bit byte swap instead.
>>
>> Fixes: be086e7c53f1 ("qed*: Utilize Firmware 8.15.3.0")
>> Signed-off-by: Arnd Bergmann 
>> ---
>>  drivers/scsi/qedf/drv_fcoe_fw_funcs.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Dave: Since you queued the firmware patch, mind taking this fix through
> your tree?

Ok, applied to net-next, thanks.

Re: [PATCH v2 03/10] be2iscsi: Replace spin_unlock_bh with spin_lock

2017-03-23 Thread Tomas Henzl

On 16.3.2017 04:24, Jitendra Bhivare wrote:
> spin_unlock_bh back_lock is used in beiscsi_eh_device_reset instead of
> spin_lock.
>
> Signed-off-by: Jitendra Bhivare 

Reviewed-by: Tomas Henzl 
tomash

Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-23 Thread Mike Snitzer

On Thu, Mar 23 2017 at 11:54am -0400,
Lars Ellenberg  wrote:

> On Thu, Mar 23, 2017 at 10:33:18AM -0400, Christoph Hellwig wrote:
> > This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> > supported by the block layer, and switches existing implementations
> > of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> > removes incorrect discard_zeroes_data, and also switches WRITE SAME
> > based zeroing in SCSI to this new method.
> > 
> > I've done testing with ATA, SCSI and NVMe setups, but there are
> > a few things that will need more attention:
> > 
> 
> >  - The DRBD code in this area was very odd,
> 
> DRBD wants all replicas to give back identical data.
> If what comes back after a discard is "undefined",
> we cannot really use that.
> 
> We used to "stack" discard only if our local backend claimed
> "discard_zeroes_data". We replicate that IO request to the peer
> as discard, and if the peer cannot do discards itself, or has
> discard_zeroes_data == 0, the peer will use zeroout instead.
> 
> One use-case for this is the device mapper "thin provisioning".
> At the time I wrote those "odd" hacks, dm thin targets
> would set discard_zeroes_data=0, NOT change discard granularity,
> but only actually discard (drop from the tree) whole "chunks",
> leaving partial start/end chunks in the mapping tree unchanged.
> 
> The logic of "only stack discard, if backend discard_zeroes_data"
> would mean that we would not be able to accept and pass down discards
> to dm-thin targets. But with data on dm-thin, you would really like
> to do the occasional fstrim.

Are you sure you aren't thinking of MD raid?  E.g. raid5's
"devices_handle_discard_safely":
parm:   devices_handle_discard_safely:Set to Y if all devices in each 
array reliably return zeroes on reads from discarded regions (bool)

I don't recall DM thinp's discard support ever having a requirement for
discard_zeroes_data.

In fact, see header from commit b60ab990ccdf3 ("dm thin: do not expose
non-zero discard limits if discards disabled"):

Also, always set discard_zeroes_data_unsupported in targets because they
should never advertise the 'discard_zeroes_data' capability (even if the
pool's data device supports it).

To this day, dm-thin.c has: ti->discard_zeroes_data_unsupported = true

Re: [PATCH v2 01/10] be2iscsi: Check tag in beiscsi_mccq_compl_wait

2017-03-23 Thread Tomas Henzl

On 16.3.2017 04:24, Jitendra Bhivare wrote:
> scsi host12: BS_1377 : mgmt_invalidate_connection Failed for cid=256
> BUG: unable to handle kernel NULL pointer dereference at 0008
> IP: [] __list_add+0xf/0xc0
> PGD 0
> Oops:  [#1] SMP
> Modules linked in:
> ...
> CPU: 9 PID: 1542 Comm: iscsid Tainted: G    T 
> 3.10.0-514.el7.x86_64 #1
> Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/12/2016
> task: 88076f310fb0 ti: 88076bba8000 task.ti: 88076bba8000
> RIP: 0010:[]  [] __list_add+0xf/0xc0
> RSP: 0018:88076bbab8e8  EFLAGS: 00010046
> RAX: 0246 RBX: 88076bbab990 RCX: 
> RDX:  RSI: 880468badf58 RDI: 88076bbab990
> RBP: 88076bbab900 R08: 0246 R09: 20de
> R10:  R11: 88076bbab5be R12: 
> R13: 880468badf58 R14: 0001adb0 R15: 88076f310fb0
> FS:  7f377124a880() GS:88046fa4() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0008 CR3: 000771318000 CR4: 001407e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Stack:
> 88076bbab990 880468badf50 0001 88076bbab938
> 810b128b 0246 cf9b7040 880468bac7a0
>  880468bac7a0 88076bbab9d0 a05a6ea3
>
> Call Trace:
> [] prepare_to_wait+0x7b/0x90
> [] beiscsi_mccq_compl_wait+0x153/0x330 [be2iscsi]
> [] ? wake_up_atomic_t+0x30/0x30
> [] beiscsi_ep_disconnect+0x91/0x2d0 [be2iscsi]
> [] iscsi_if_ep_disconnect.isra.14+0x5a/0x70 
> [scsi_transport_iscsi]
> [] iscsi_if_recv_msg+0x113b/0x14a0 [scsi_transport_iscsi]
> [] ? __kmalloc_node_track_caller+0x58/0x290
> [] iscsi_if_rx+0x8e/0x1f0 [scsi_transport_iscsi]
> [] netlink_unicast+0xed/0x1b0
> [] netlink_sendmsg+0x31e/0x690
> [] ? netlink_rcv_wake+0x44/0x60
> [] ? netlink_recvmsg+0x1e3/0x450
>
> beiscsi_mccq_compl_wait gets called even when MCC tag allocation failed
> for mgmt_invalidate_connection.
> mcc_wait is not initialized for tag 0 so causes crash in prepare_to_wait.
>
> Signed-off-by: Jitendra Bhivare 

Reviewed-by: Tomas Henzl 
tomash

Re: [PATCH v2 02/10] be2iscsi: Fix closing of connection

2017-03-23 Thread Tomas Henzl

On 16.3.2017 04:24, Jitendra Bhivare wrote:
> CID needs to be freed even when invalidate or upload connection fails.
> Attempt to close connection 3 times before freeing CID.
>
> Set cleanup_type to INVALIDATE instead of force TCP_RST.
> This unnecessarily is terminating connection with reset instead of
> gracefully closing it.
>
> Set save_cfg to 0 - session not to be saved on flash.
>
> Add delay and process CQ before uploading connection.
>
> Signed-off-by: Jitendra Bhivare 
> ---
>  drivers/scsi/be2iscsi/be.h   |   1 -
>  drivers/scsi/be2iscsi/be_cmds.h  |  63 +--
>  drivers/scsi/be2iscsi/be_iscsi.c |  98 +-
>  drivers/scsi/be2iscsi/be_mgmt.c  | 127 
> ---
>  drivers/scsi/be2iscsi/be_mgmt.h  |  30 ++---
>  5 files changed, 159 insertions(+), 160 deletions(-)
>
> diff --git a/drivers/scsi/be2iscsi/be.h b/drivers/scsi/be2iscsi/be.h
> index ca9440f..4dd8de4 100644
> --- a/drivers/scsi/be2iscsi/be.h
> +++ b/drivers/scsi/be2iscsi/be.h
> @@ -154,7 +154,6 @@ struct be_ctrl_info {
>  #define PAGE_SHIFT_4K 12
>  #define PAGE_SIZE_4K (1 << PAGE_SHIFT_4K)
>  #define mcc_timeout  12 /* 12s timeout */
> -#define BEISCSI_LOGOUT_SYNC_DELAY250
>  
>  /* Returns number of pages spanned by the data starting at the given addr */
>  #define PAGES_4K_SPANNED(_address, size) \
> diff --git a/drivers/scsi/be2iscsi/be_cmds.h b/drivers/scsi/be2iscsi/be_cmds.h
> index 1d40e83..88fe731 100644
> --- a/drivers/scsi/be2iscsi/be_cmds.h
> +++ b/drivers/scsi/be2iscsi/be_cmds.h
> @@ -1145,24 +1145,49 @@ struct tcp_connect_and_offload_out {
>  #define DB_DEF_PDU_EVENT_SHIFT   15
>  #define DB_DEF_PDU_CQPROC_SHIFT  16
>  
> -struct dmsg_cqe {
> - u32 dw[4];
> +struct be_invalidate_connection_params_in {
> + struct be_cmd_req_hdr hdr;
> + u32 session_handle;
> + u16 cid;
> + u16 unused;
> +#define BE_CLEANUP_TYPE_INVALIDATE   0x8001
> +#define BE_CLEANUP_TYPE_ISSUE_TCP_RST0x8002
> + u16 cleanup_type;
> + u16 save_cfg;
> +} __packed;
> +
> +struct be_invalidate_connection_params_out {
> + u32 session_handle;
> + u16 cid;
> + u16 unused;
>  } __packed;
>  
> -struct tcp_upload_params_in {
> +union be_invalidate_connection_params {
> + struct be_invalidate_connection_params_in req;
> + struct be_invalidate_connection_params_out resp;
> +} __packed;
> +
> +struct be_tcp_upload_params_in {
>   struct be_cmd_req_hdr hdr;
>   u16 id;
> +#define BE_UPLOAD_TYPE_GRACEFUL  1
> +/* abortive upload with reset */
> +#define BE_UPLOAD_TYPE_ABORT_RESET   2
> +/* abortive upload without reset */
> +#define BE_UPLOAD_TYPE_ABORT 3
> +/* abortive upload with reset, sequence number by driver */
> +#define BE_UPLOAD_TYPE_ABORT_WITH_SEQ4
>   u16 upload_type;
>   u32 reset_seq;
>  } __packed;
>  
> -struct tcp_upload_params_out {
> +struct be_tcp_upload_params_out {
>   u32 dw[32];
>  } __packed;
>  
> -union tcp_upload_params {
> - struct tcp_upload_params_in request;
> - struct tcp_upload_params_out response;
> +union be_tcp_upload_params {
> + struct be_tcp_upload_params_in request;
> + struct be_tcp_upload_params_out response;
>  } __packed;
>  
>  struct be_ulp_fw_cfg {
> @@ -1243,10 +1268,7 @@ struct be_cmd_get_port_name {
>  #define OPCODE_COMMON_WRITE_FLASH96
>  #define OPCODE_COMMON_READ_FLASH 97
>  
> -/* --- CMD_ISCSI_INVALIDATE_CONNECTION_TYPE --- */
>  #define CMD_ISCSI_COMMAND_INVALIDATE 1
> -#define CMD_ISCSI_CONNECTION_INVALIDATE  0x8001
> -#define CMD_ISCSI_CONNECTION_ISSUE_TCP_RST   0x8002
>  
>  #define INI_WR_CMD   1   /* Initiator write command */
>  #define INI_TMF_CMD  2   /* Initiator TMF command */
> @@ -1269,27 +1291,6 @@ struct be_cmd_get_port_name {
>*  preparedby
>* driver should not be touched
>*/
> -/* --- CMD_CHUTE_TYPE --- */
> -#define CMD_CONNECTION_CHUTE_0   1
> -#define CMD_CONNECTION_CHUTE_1   2
> -#define CMD_CONNECTION_CHUTE_2   3
> -
> -#define EQ_MAJOR_CODE_COMPLETION 0
> -
> -#define CMD_ISCSI_SESSION_DEL_CFG_FROM_FLASH 0
> -#define CMD_ISCSI_SESSION_SAVE_CFG_ON_FLASH 1
> -
> -/* --- CONNECTION_UPLOAD_PARAMS --- */
> -/* These parameters are used to define the type of upload desired.  */
> -#define CONNECTION_UPLOAD_GRACEFUL  1/* Graceful upload  */
> -#define CONNECTION_UPLOAD_ABORT_RESET   2/* Abortive upload with
> -  * reset
> -  */
> -#define CONNECTION_UPLOAD_ABORT  3   /* Abortive upload 
> without
> -  * reset
> -

Re: [PATCH 1/6] ipr: Fix missed EH wakeup

2017-03-23 Thread Martin K. Petersen

Brian King  writes:

> Following a command abort or device reset, ipr's EH handlers wait
> for the commands getting aborted to get sent back from the adapter
> prior to returning from the EH handler. This fixes up some cases
> where the completion handler was not getting called, which would
> have resulted in the EH thread waiting until it timed out, greatly
> extending EH time.

Applied to 4.12/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCHv4 0/4] tcmu: bug fix and dynamic growing data area support

2017-03-23 Thread Ilias Tsitsimpis

Hi Xiubo,

On Tue, Mar 21, 2017 at 04:36PM, lixi...@cmss.chinamobile.com wrote:
> [...]
>   tcmu: Fix possible overwrite of t_data_sg's last iov[]
>   tcmu: Fix wrongly calculating of the base_command_size

I tested these two patches, which try to fix the broken support for
BIDI commands in target/user.

Both look good to me, but unfortunately, there is also another
regression which got introduced with the use of the data_bitmap. More
specifically, in case of BIDI commands, the data bitmap records both the
Data-Out and the Data-In buffer, and so when gathering the data in the
tcmu_handle_completion() function, care should be taken in order to
discard the Data-Out buffer before gathering the Data-In buffer.
Something like this should do the trick:

diff --git a/drivers/target/target_core_user.c 
b/drivers/target/target_core_user.c
index 1108bf5..7075161 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -610,10 +610,22 @@ static void tcmu_handle_completion(struct tcmu_cmd *cmd, 
struct tcmu_cmd_entry *
   se_cmd->scsi_sense_length);
free_data_area(udev, cmd);
} else if (se_cmd->se_cmd_flags & SCF_BIDI) {
+   struct scatterlist *sg;
+   int n, block, sg_remaining = 0;
DECLARE_BITMAP(bitmap, DATA_BLOCK_BITS);

-   /* Get Data-In buffer before clean up */
bitmap_copy(bitmap, cmd->data_bitmap, DATA_BLOCK_BITS);
+
+   /* Discard Data-Out buffer */
+   for_each_sg(se_cmd->t_data_sg, sg, se_cmd->t_data_nents, n) {
+   sg_remaining += sg->length;
+   while (sg_remaining > 0) {
+   block = find_first_bit(bitmap, DATA_BLOCK_BITS);
+   clear_bit(block, bitmap);
+   sg_remaining -= DATA_BLOCK_SIZE;
+   }
+   }
+
+   /* Get Data-In buffer before clean up */
gather_data_area(udev, bitmap,
se_cmd->t_bidi_data_sg, se_cmd->t_bidi_data_nents);
free_data_area(udev, cmd);

With your patches, and the above diff, support for BIDI commands in the
target/user seems to be working again.

Could you please validate the above snippet, and update your patches to
include it?

-- 
Ilias

Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-23 Thread Lars Ellenberg

On Thu, Mar 23, 2017 at 10:33:18AM -0400, Christoph Hellwig wrote:
> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.
> 
> I've done testing with ATA, SCSI and NVMe setups, but there are
> a few things that will need more attention:
> 

>  - The DRBD code in this area was very odd,

DRBD wants all replicas to give back identical data.
If what comes back after a discard is "undefined",
we cannot really use that.

We used to "stack" discard only if our local backend claimed
"discard_zeroes_data". We replicate that IO request to the peer
as discard, and if the peer cannot do discards itself, or has
discard_zeroes_data == 0, the peer will use zeroout instead.

One use-case for this is the device mapper "thin provisioning".
At the time I wrote those "odd" hacks, dm thin targets
would set discard_zeroes_data=0, NOT change discard granularity,
but only actually discard (drop from the tree) whole "chunks",
leaving partial start/end chunks in the mapping tree unchanged.

The logic of "only stack discard, if backend discard_zeroes_data"
would mean that we would not be able to accept and pass down discards
to dm-thin targets. But with data on dm-thin, you would really like
to do the occasional fstrim.

Also, IO backends on the peers do not have to have the same
characteristics.  You could have the DRBD Primary on some SSD,
and the Secondary on some thin-pool LV,
scheduling thin snapthots in intervals or on demand.

With the logic of "use zero-out instead", fstrim would cause it to
fully allocate what was supposed to be thinly provisioned :-(

So what I did there was optionally tell DRBD that
"discard_zeroes_data == 0" on that peer would actually mean
"discard_zeroes_data == 1,
 IF you zero-out the partial chunks of this granularity yourself".

And implemented this "discard aligned chunks of that granularity,
and zero-out partial start/end chunks, if any".

And then claim to upper layers that, yes, discard_zeroes_data=1,
in that case, if so configured, even if our backend (dm-thin)
would say discard_zeroes_data=0.

Does that make sense?  Can we still do that?  Has something like that
been done in block core or device mapper meanwhile?

> and will need an audit from the maintainers.

Will need to make some time for review and testing.

Thanks,

Lars

Re: [PATCH 1/6] ipr: Fix missed EH wakeup

2017-03-23 Thread wenxiong


On 2017-03-15 16:58, Brian King wrote:

I have reviewed this serial of patches and tested them on IBM systems 
successfully


Thanks for your help!
Wendy

Following a command abort or device reset, ipr's EH handlers wait
for the commands getting aborted to get sent back from the adapter
prior to returning from the EH handler. This fixes up some cases
where the completion handler was not getting called, which would
have resulted in the EH thread waiting until it timed out, greatly
extending EH time.

Signed-off-by: Brian King 
---

 drivers/scsi/ipr.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff -puN drivers/scsi/ipr.c~ipr_fix_missed_eh_wakeup 
drivers/scsi/ipr.c

---
linux-2.6.git/drivers/scsi/ipr.c~ipr_fix_missed_eh_wakeup   2017-03-13
14:52:17.974545318 -0500
+++ linux-2.6.git-bjking1/drivers/scsi/ipr.c	2017-03-13 
17:03:13.635568644 -0500

@@ -836,8 +836,10 @@ static void ipr_sata_eh_done(struct ipr_

qc->err_mask |= AC_ERR_OTHER;
sata_port->ioasa.status |= ATA_BUSY;
-   list_add_tail(&ipr_cmd->queue, &ipr_cmd->hrrq->hrrq_free_q);
ata_qc_complete(qc);
+   if (ipr_cmd->eh_comp)
+   complete(ipr_cmd->eh_comp);
+   list_add_tail(&ipr_cmd->queue, &ipr_cmd->hrrq->hrrq_free_q);
 }

 /**
@@ -5947,8 +5949,10 @@ static void ipr_erp_done(struct ipr_cmnd
res->in_erp = 0;
}
scsi_dma_unmap(ipr_cmd->scsi_cmd);
-   list_add_tail(&ipr_cmd->queue, &ipr_cmd->hrrq->hrrq_free_q);
scsi_cmd->scsi_done(scsi_cmd);
+   if (ipr_cmd->eh_comp)
+   complete(ipr_cmd->eh_comp);
+   list_add_tail(&ipr_cmd->queue, &ipr_cmd->hrrq->hrrq_free_q);
 }

 /**
@@ -6338,8 +6342,10 @@ static void ipr_erp_start(struct ipr_ioa
}

scsi_dma_unmap(ipr_cmd->scsi_cmd);
-   list_add_tail(&ipr_cmd->queue, &ipr_cmd->hrrq->hrrq_free_q);
scsi_cmd->scsi_done(scsi_cmd);
+   if (ipr_cmd->eh_comp)
+   complete(ipr_cmd->eh_comp);
+   list_add_tail(&ipr_cmd->queue, &ipr_cmd->hrrq->hrrq_free_q);
 }

 /**
@@ -6365,8 +6371,10 @@ static void ipr_scsi_done(struct ipr_cmn
scsi_dma_unmap(scsi_cmd);

spin_lock_irqsave(ipr_cmd->hrrq->lock, lock_flags);
-   list_add_tail(&ipr_cmd->queue, &ipr_cmd->hrrq->hrrq_free_q);
scsi_cmd->scsi_done(scsi_cmd);
+   if (ipr_cmd->eh_comp)
+   complete(ipr_cmd->eh_comp);
+   list_add_tail(&ipr_cmd->queue, &ipr_cmd->hrrq->hrrq_free_q);
spin_unlock_irqrestore(ipr_cmd->hrrq->lock, lock_flags);
} else {
spin_lock_irqsave(ioa_cfg->host->host_lock, lock_flags);
_

sg: random memory corruptions

2017-03-23 Thread Dmitry Vyukov

Hello,

The following program causes random assorted memory corruptions:

https://gist.githubusercontent.com/dvyukov/da3463af2d1ff8c7d3624891b5d7427f/raw/09cf0f4af529f4506f9e0a9fa6bdb066a8777b9d/gistfile1.txt

It does some ioctl's on /dev/sg0.

general protection fault:  [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 2843 Comm: rsyslogd Not tainted 4.11.0-rc3+ #365
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: 880062754300 task.stack: 880062fc8000
RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline]
RIP: 0010:radix_tree_load_root lib/radix-tree.c:602 [inline]
RIP: 0010:__radix_tree_lookup+0x12c/0x5e0 lib/radix-tree.c:1040
RSP: :880062fced90 EFLAGS: 00010202
RAX: 00f37b916d5e RBX: 079bdc8b6ae8 RCX: 880062fcefa8
RDX:  RSI: 0001622819596228 RDI: 079bdc8b6ae8
RBP: 880062fcef78 R08: ed000da135c2 R09: ed000da135c2
R10: 0001 R11: ed000da135c1 R12: 880062fcefa8
R13: dc00 R14: 079bdc8b6ae8 R15: 0001622819596228
FS:  7f773eae2700() GS:88006d08() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f773eae1e30 CR3: 64ac CR4: 001406e0
Call Trace:
 radix_tree_lookup_slot+0x78/0xe0 lib/radix-tree.c:1079
 find_get_entry+0x186/0x990 mm/filemap.c:1190
 pagecache_get_page+0x116/0xb60 mm/filemap.c:1298
 find_get_page include/linux/pagemap.h:258 [inline]
 lookup_swap_cache+0x8d/0x110 mm/swap_state.c:296
 do_swap_page+0x278/0x2110 mm/memory.c:2702
 handle_pte_fault mm/memory.c:3727 [inline]
 __handle_mm_fault+0x1747/0x3e70 mm/memory.c:3841
 handle_mm_fault+0x141/0x4f0 mm/memory.c:3878
 __do_page_fault+0x503/0xb50 arch/x86/mm/fault.c:1397
 trace_do_page_fault+0x145/0x720 arch/x86/mm/fault.c:1490
 do_async_page_fault+0x72/0xc0 arch/x86/kernel/kvm.c:264
 async_page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1014
RIP: 0033:0x7f77411261fd
RSP: 002b:7f773eae1e30 EFLAGS: 00010293
RAX: 0024 RBX: 020944d0 RCX: 7f77411261fd
RDX: 0fff RSI: 7f77405225a0 RDI: 0004
RBP:  R08: 0207f160 R09: 0401
R10: 0001 R11: 0293 R12: 0065e420
R13: 7f773eae29c0 R14: 7f774176b040 R15: 0003
Code: ff 42 c6 04 28 00 48 8d 85 d8 fe ff ff 48 c1 e8 03 42 c6 04 28
00 48 8b 85 48 fe ff ff c6 00 00 48 8b 85 40 fe ff ff 48 c1 e8 03 <42>
80 3c 28 00 0f 85 07 04 00 00 48 8b 85 38 fe ff ff 4c 8b 60
RIP: __read_once_size include/linux/compiler.h:254 [inline] RSP:
880062fced90
RIP: radix_tree_load_root lib/radix-tree.c:602 [inline] RSP: 880062fced90
RIP: __radix_tree_lookup+0x12c/0x5e0 lib/radix-tree.c:1040 RSP: 880062fced90
---[ end trace 53d928cd2f7a08d4 ]---

BUG: unable to handle kernel paging request at 015281a4
IP: qlist_move_cache+0x74/0x100 mm/kasan/quarantine.c:274
PGD 0
Oops:  [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 29 Comm: kworker/u8:1 Not tainted 4.11.0-rc3+ #365
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: netns cleanup_net
task: 88006c096800 task.stack: 88006c098000
RIP: 0010:qlist_move_cache+0x74/0x100 mm/kasan/quarantine.c:274
RSP: 0018:88006c09f368 EFLAGS: 00010002
RAX: 015281a4 RBX: 77ff8000 RCX: 015281a4
RDX: 8800696dd200 RSI: 88006c09f388 RDI: 865cce58
RBP: 88006c09f378 R08: 88006c019340 R09: 8000
R10: 015281a4 R11: ea00 R12: ea00016892ef
R13: 84fa9740 R14: 865d1760 R15: 8800696dd200
FS:  () GS:88006d00() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 015281a4 CR3: 69898000 CR4: 001406f0
Call Trace:
 quarantine_remove_cache+0x79/0xf0 mm/kasan/quarantine.c:317
 kasan_cache_shutdown+0x9/0x10 mm/kasan/kasan.c:451
 shutdown_cache mm/slab_common.c:532 [inline]
 kmem_cache_destroy+0x52/0x120 mm/slab_common.c:830
 tipc_server_stop+0x13f/0x190 net/tipc/server.c:636
 tipc_topsrv_stop+0x200/0x360 net/tipc/subscr.c:397
 tipc_exit_net+0x15/0x40 net/tipc/core.c:96
 ops_exit_list.isra.4+0xae/0x150 net/core/net_namespace.c:141
 cleanup_net+0x5c7/0xb60 net/core/net_namespace.c:463
 process_one_work+0xb20/0x1b40 kernel/workqueue.c:2097
 worker_thread+0x1b4/0x1340 kernel/workqueue.c:2231
 kthread+0x359/0x420 kernel/kthread.c:229
 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
Code: 00 8b 49 14 0f 84 87 00 00 00 4c 8b 47 08 49 89 00 48 89 47 08
48 c7 00 00 00 00 00 4c 89 d0 48 01 4f 10 4d 85 d2 74 64 48 89 c1 <4c>
8b 10 4c 01 c9 72 6d 49 89 d8 4c 01 c1 48 c1 e9 0c 4c 8d 04
RIP: qlist_move_cache+0x74/0x100 mm/kasan/quarantine.c:274 RSP: 88006c09f368
CR2: 015281a4
---[ end trace 9db83f7c295b4f05 ]---

BUG: unable to handle kernel paging request at eba5308001e0
IP: virt_to_head_page include/linux/mm.h:570 [inline]
IP: qlink_t

Re: support ranges TRIM for libata

2017-03-23 Thread Martin K. Petersen

Christoph Hellwig  writes:

> Meh, I remember why I gave up on it - to support queued trim passthrough
> we'd need to implement ATA 32 for the auxiliary fields and thus support
> 32-bit CDBs.  I don't really want to go there..

I wish we could stick with SATL. However, I have attempted this a few
times over the years and I am with Christoph here. Due to the
discrepancies between ACS and SBC it's a twisted mess. And the iterative
approach has not worked well for HBA SATLs either. Quite the contrary.

So I think sticking the DSM TRIM payload onto a SCSI command is the path
of least resistance. And then NAK'ing attempts to issue these commands
through sg.

I'll take closer look at the entire series tomorrow or Monday. I want to
test multiple ranges on my "Little Shop of SSD Horrors" when I get back
home from LSF/MM.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: support ranges TRIM for libata

2017-03-23 Thread Christoph Hellwig

On Thu, Mar 23, 2017 at 03:43:30PM +0100, Christoph Hellwig wrote:
> I tried this earlier before giving up on it because it looked to ugly.
> But I can complete that version of it and post it for people to compare.

Meh, I remember why I gave up on it - to support queued trim passthrough
we'd need to implement ATA 32 for the auxiliary fields and thus support
32-bit CDBs.  I don't really want to go there..

Re: support ranges TRIM for libata

2017-03-23 Thread Christoph Hellwig

On Thu, Mar 23, 2017 at 11:04:58AM -0400, Tejun Heo wrote:
> I kinda like the idea of sticking with satl as that's how libata has
> been doing most things even if the implementation is uglier.  It'd be
> great to find out whether the ugliness would be acceptable or too
> much.

The SATL way is simply broken.  I'll just leave the code corrupting
user data and 5 times slower than it could be then.

Re: [PATCH] Fix PT2PT PRLI reject.

2017-03-23 Thread Martin K. Petersen

Dick Kennedy  writes:

Dick,

> lpfc cannot establish connection with targets that send PRLI in P2P
> configurations.
>
> If lpfc rejects a PRLI that is sent from a target the target will not
> resend and will reject the PRLI send from the initiator.

Applied to 4.11/scsi-fixes (by hand, due to being mangled).

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH 00/23] hisi_sas: error handling and other misc fixes and improvements

2017-03-23 Thread Martin K. Petersen

John Garry  writes:

John,

> This patchset introduces a range of error handling
> and other misc improvements for the HiSilicon SAS
> controller, including:
> - controller reset function
> - softreset for SATA error handling
> - fixes for slot free'ing
> - v2 hw error handling improvements
> - and other misc, more minor stuff

Applied to 4.12/scsi-queue. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH 06/23] dm-kcopyd: switch to use REQ_OP_WRITE_ZEROES

2017-03-23 Thread Mike Snitzer

On Thu, Mar 23 2017 at 10:56am -0400,
Christoph Hellwig  wrote:

> On Thu, Mar 23, 2017 at 10:55:22AM -0400, Mike Snitzer wrote:
> > See commit 70d6c400a ("dm kcopyd: add WRITE SAME support to dm_kcopyd_zero")
> > drivers/md/dm-io.c:do_region() adjusts the WRITE SAME payload to be a
> > single page.
> > 
> > So you'd want to tweak dm-io.c accordingly for WRITE ZEROES (presummably
> > no payload?)
> 
> Yeah, I'll look into it.

Ah, your previous 05/23 patch re-uses the discard branch in
dm-io.c:do_region.  Looks correct.  So you should be good.

Not sure why you've split out the dm-kcopyd patch, likely best to just
fold it into the previous dm support patch.

I'll try your code out on my testbed, probably tomorrow, but in general
the DM code looks solid (thanks for doing it!).  But I'll take a closer
look and will report back with my Reviewed-by if all looks (and tests
out) good.

> Any good test case for verifying this code while I've got your
> attention?

DM thinp is the only consumer of dm_kcopyd_zero().  DM thinp
conservatively defaults to zeroing (but we recommend
'skip_block_zeroing' for most setups).  So anyway, just using a DM thin
device with partial thin blocksize IO (without 'skip_block_zeroing')
should get you coverage.

The device-mapper-test-suite thin tests have thinp w/ zeroing coverage
so I'll be sure to run those.

Re: support ranges TRIM for libata

2017-03-23 Thread Tejun Heo

Hello, Christoph.

On Thu, Mar 23, 2017 at 03:43:30PM +0100, Christoph Hellwig wrote:
> > That's up to you ... from the point of view of code documenting itself,
> > forming the ATA_16 TRIM in sd and not doing any satl transformation is
> > easier for others to follow, but if it's going to cause more code, I'm
> > only marginal on the advantages of easier to follow code.
> 
> I tried this earlier before giving up on it because it looked to ugly.
> But I can complete that version of it and post it for people to compare.

I kinda like the idea of sticking with satl as that's how libata has
been doing most things even if the implementation is uglier.  It'd be
great to find out whether the ugliness would be acceptable or too
much.

Thanks.

-- 
tejun

[PATCH] scsi: advansys: fix uninitialized data access

2017-03-23 Thread Arnd Bergmann

gcc-7.0.1 now warns about a previously unnoticed access of
uninitialized struct members:

drivers/scsi/advansys.c: In function 'AscMsgOutSDTR':
drivers/scsi/advansys.c:3860:26: error: '*((void *)&sdtr_buf+5)' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
 ((ushort)s_buffer[i + 1] << 8) | s_buffer[i]);
  ^
drivers/scsi/advansys.c:3860:26: error: '*((void *)&sdtr_buf+7)' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
drivers/scsi/advansys.c:3860:26: error: '*((void *)&sdtr_buf+5)' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
drivers/scsi/advansys.c:3860:26: error: '*((void *)&sdtr_buf+7)' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

The code has existed in this exact form at least since v2.6.12, and the
warning seems correct. This uses named initializers to ensure we initialize
all members of the structure.

Signed-off-by: Arnd Bergmann 
---
 drivers/scsi/advansys.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/advansys.c b/drivers/scsi/advansys.c
index 81dd0927246b..24e57e770432 100644
--- a/drivers/scsi/advansys.c
+++ b/drivers/scsi/advansys.c
@@ -6291,18 +6291,17 @@ static uchar AscGetSynPeriodIndex(ASC_DVC_VAR *asc_dvc, 
uchar syn_time)
 static uchar
 AscMsgOutSDTR(ASC_DVC_VAR *asc_dvc, uchar sdtr_period, uchar sdtr_offset)
 {
-   EXT_MSG sdtr_buf;
-   uchar sdtr_period_index;
-   PortAddr iop_base;
-
-   iop_base = asc_dvc->iop_base;
-   sdtr_buf.msg_type = EXTENDED_MESSAGE;
-   sdtr_buf.msg_len = MS_SDTR_LEN;
-   sdtr_buf.msg_req = EXTENDED_SDTR;
-   sdtr_buf.xfer_period = sdtr_period;
+   PortAddr iop_base = asc_dvc->iop_base;
+   uchar sdtr_period_index = AscGetSynPeriodIndex(asc_dvc, sdtr_period);
+   EXT_MSG sdtr_buf = {
+   .msg_type = EXTENDED_MESSAGE,
+   .msg_len = MS_SDTR_LEN,
+   .msg_req = EXTENDED_SDTR,
+   .xfer_period = sdtr_period,
+   .req_ack_offset = sdtr_offset,
+   };
sdtr_offset &= ASC_SYN_MAX_OFFSET;
-   sdtr_buf.req_ack_offset = sdtr_offset;
-   sdtr_period_index = AscGetSynPeriodIndex(asc_dvc, sdtr_period);
+
if (sdtr_period_index <= asc_dvc->max_sdtr_index) {
AscMemWordCopyPtrToLram(iop_base, ASCV_MSGOUT_BEG,
(uchar *)&sdtr_buf,
-- 
2.9.0

Re: [PATCH 06/23] dm-kcopyd: switch to use REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

On Thu, Mar 23, 2017 at 10:55:22AM -0400, Mike Snitzer wrote:
> See commit 70d6c400a ("dm kcopyd: add WRITE SAME support to dm_kcopyd_zero")
> drivers/md/dm-io.c:do_region() adjusts the WRITE SAME payload to be a
> single page.
> 
> So you'd want to tweak dm-io.c accordingly for WRITE ZEROES (presummably
> no payload?)

Yeah, I'll look into it.  Any good test case for verifying this code
while I've got your attention?

[PATCH] scsi: lpfc: fix building without debugfs support

2017-03-23 Thread Arnd Bergmann

On a randconfig build without CONFIG_SCSI_LPFC_DEBUG_FS, I ran into
multiple compile failures:

drivers/scsi/lpfc/lpfc_debugfs.h: In function 'lpfc_debug_dump_wq':
drivers/scsi/lpfc/lpfc_debugfs.h:405:15: error: 'DUMP_FCP' undeclared (first 
use in this function); did you mean 'DUMP_VAR'?
drivers/scsi/lpfc/lpfc_debugfs.h:405:15: note: each undeclared identifier is 
reported only once for each function it appears in
drivers/scsi/lpfc/lpfc_debugfs.h:408:22: error: 'DUMP_NVME' undeclared (first 
use in this function); did you mean 'DUMP_NONE'?
drivers/scsi/lpfc/lpfc_nvmet.c: In function 'lpfc_nvmet_xmt_ls_rsp_cmp':
drivers/scsi/lpfc/lpfc_nvmet.c:109:2: error: implicit declaration of function 
'lpfc_nvmeio_data'; did you mean 'lpfc_mem_free'? 
[-Werror=implicit-function-declaration]
drivers/scsi/lpfc/lpfc_nvmet.c: In function 'lpfc_nvmet_xmt_fcp_op':
drivers/scsi/lpfc/lpfc_nvmet.c:523:10: error: unused variable 'id' 
[-Werror=unused-variable]

They are all trivial to fix, so I'm doing it in a combined patch here.

Fixes: 1d9d5a9879ad ("scsi: lpfc: refactor debugfs queue dump routines")
Fixes: bd2cdd5e400f ("scsi: lpfc: NVME Initiator: Add debugfs support")
Fixes: 2b65e18202fd ("scsi: lpfc: NVME Target: Add debugfs support")
Signed-off-by: Arnd Bergmann 
---
 drivers/scsi/lpfc/lpfc_debugfs.h | 22 ++
 drivers/scsi/lpfc/lpfc_nvmet.c   |  4 ++--
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_debugfs.h b/drivers/scsi/lpfc/lpfc_debugfs.h
index c05f56c3023f..7b7d314af0e0 100644
--- a/drivers/scsi/lpfc/lpfc_debugfs.h
+++ b/drivers/scsi/lpfc/lpfc_debugfs.h
@@ -44,14 +44,6 @@
 /* hbqinfo output buffer size */
 #define LPFC_HBQINFO_SIZE 8192
 
-enum {
-   DUMP_FCP,
-   DUMP_NVME,
-   DUMP_MBX,
-   DUMP_ELS,
-   DUMP_NVMELS,
-};
-
 /* nvmestat output buffer size */
 #define LPFC_NVMESTAT_SIZE 8192
 #define LPFC_NVMEKTIME_SIZE 8192
@@ -283,8 +275,22 @@ struct lpfc_idiag {
struct lpfc_idiag_offset offset;
void *ptr_private;
 };
+
+#else
+
+#define lpfc_nvmeio_data(phba, fmt, arg...) \
+   no_printk(fmt, ##arg)
+
 #endif
 
+enum {
+   DUMP_FCP,
+   DUMP_NVME,
+   DUMP_MBX,
+   DUMP_ELS,
+   DUMP_NVMELS,
+};
+
 /* Mask for discovery_trace */
 #define LPFC_DISC_TRC_ELS_CMD  0x1 /* Trace ELS commands */
 #define LPFC_DISC_TRC_ELS_RSP  0x2 /* Trace ELS response */
diff --git a/drivers/scsi/lpfc/lpfc_nvmet.c b/drivers/scsi/lpfc/lpfc_nvmet.c
index 7ca868f394da..acba1b67e505 100644
--- a/drivers/scsi/lpfc/lpfc_nvmet.c
+++ b/drivers/scsi/lpfc/lpfc_nvmet.c
@@ -520,7 +520,7 @@ lpfc_nvmet_xmt_fcp_op(struct nvmet_fc_target_port *tgtport,
struct lpfc_hba *phba = ctxp->phba;
struct lpfc_iocbq *nvmewqeq;
unsigned long iflags;
-   int rc, id;
+   int rc;
 
 #ifdef CONFIG_SCSI_LPFC_DEBUG_FS
if (phba->ktime_on) {
@@ -530,7 +530,7 @@ lpfc_nvmet_xmt_fcp_op(struct nvmet_fc_target_port *tgtport,
ctxp->ts_nvme_data = ktime_get_ns();
}
if (phba->cpucheck_on & LPFC_CHECK_NVMET_IO) {
-   id = smp_processor_id();
+   int id = smp_processor_id();
ctxp->cpu = id;
if (id < LPFC_CHECK_CPU_CNT)
phba->cpucheck_xmt_io[id]++;
-- 
2.9.0

Re: [PATCH 06/23] dm-kcopyd: switch to use REQ_OP_WRITE_ZEROES

2017-03-23 Thread Mike Snitzer

On Thu, Mar 23 2017 at 10:33am -0400,
Christoph Hellwig  wrote:

> It seems like the code currently passes whatever it was using for writes
> to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
> need any payload.
> 
> Untested, and confused by the code, maybe someone who understands it
> better than me can help..
> 
> Not-yet-signed-off-by: Christoph Hellwig 

See commit 70d6c400a ("dm kcopyd: add WRITE SAME support to dm_kcopyd_zero")
drivers/md/dm-io.c:do_region() adjusts the WRITE SAME payload to be a
single page.

So you'd want to tweak dm-io.c accordingly for WRITE ZEROES (presummably
no payload?)

Mike

> ---
>  drivers/md/dm-kcopyd.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 9e9d04cb7d51..f85846741d50 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct 
> dm_io_region *from,
>   job->pages = &zero_page_list;
>  
>   /*
> -  * Use WRITE SAME to optimize zeroing if all dests support it.
> +  * Use WRITE ZEROES to optimize zeroing if all dests support it.
>*/
> - job->rw = REQ_OP_WRITE_SAME;
> + job->rw = REQ_OP_WRITE_ZEROES;
>   for (i = 0; i < job->num_dests; i++)
> - if (!bdev_write_same(job->dests[i].bdev)) {
> + if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) {
>   job->rw = WRITE;
>   break;
>   }
> -- 
> 2.11.0
>

[PATCH 16/23] brd: remove discard support

2017-03-23 Thread Christoph Hellwig

It's just a in-driver reimplementation of writing zeroes to the pages,
which fails if the discards aren't page aligned.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/brd.c | 56 -
 1 file changed, 56 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..c15d0581531f 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -134,28 +134,6 @@ static struct page *brd_insert_page(struct brd_device 
*brd, sector_t sector)
return page;
 }
 
-static void brd_free_page(struct brd_device *brd, sector_t sector)
-{
-   struct page *page;
-   pgoff_t idx;
-
-   spin_lock(&brd->brd_lock);
-   idx = sector >> PAGE_SECTORS_SHIFT;
-   page = radix_tree_delete(&brd->brd_pages, idx);
-   spin_unlock(&brd->brd_lock);
-   if (page)
-   __free_page(page);
-}
-
-static void brd_zero_page(struct brd_device *brd, sector_t sector)
-{
-   struct page *page;
-
-   page = brd_lookup_page(brd, sector);
-   if (page)
-   clear_highpage(page);
-}
-
 /*
  * Free all backing store pages and radix tree. This must only be called when
  * there are no other users of the device.
@@ -212,24 +190,6 @@ static int copy_to_brd_setup(struct brd_device *brd, 
sector_t sector, size_t n)
return 0;
 }
 
-static void discard_from_brd(struct brd_device *brd,
-   sector_t sector, size_t n)
-{
-   while (n >= PAGE_SIZE) {
-   /*
-* Don't want to actually discard pages here because
-* re-allocating the pages can result in writeback
-* deadlocks under heavy load.
-*/
-   if (0)
-   brd_free_page(brd, sector);
-   else
-   brd_zero_page(brd, sector);
-   sector += PAGE_SIZE >> SECTOR_SHIFT;
-   n -= PAGE_SIZE;
-   }
-}
-
 /*
  * Copy n bytes from src to the brd starting at sector. Does not sleep.
  */
@@ -338,14 +298,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, 
struct bio *bio)
if (bio_end_sector(bio) > get_capacity(bdev->bd_disk))
goto io_error;
 
-   if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
-   if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) ||
-   bio->bi_iter.bi_size & ~PAGE_MASK)
-   goto io_error;
-   discard_from_brd(brd, sector, bio->bi_iter.bi_size);
-   goto out;
-   }
-
bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;
int err;
@@ -357,9 +309,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, 
struct bio *bio)
sector += len >> SECTOR_SHIFT;
}
 
-out:
-   bio_endio(bio);
-   return BLK_QC_T_NONE;
 io_error:
bio_io_error(bio);
return BLK_QC_T_NONE;
@@ -464,11 +413,6 @@ static struct brd_device *brd_alloc(int i)
 *  is harmless)
 */
blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);
-
-   brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
-   blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
-   brd->brd_queue->limits.discard_zeroes_data = 1;
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
 #ifdef CONFIG_BLK_DEV_RAM_DAX
queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
 #endif
-- 
2.11.0

Re: support ranges TRIM for libata

2017-03-23 Thread Christoph Hellwig

On Thu, Mar 23, 2017 at 10:35:06AM -0400, James Bottomley wrote:
> I'm certainly not saying we blindly follow t10, but I believe their
> intent is to issue the next command from the completion of the first
> (we can do this using qc->complete_fn, like atapi_request_sense).  That
> way we don't get any tag problems because there's only one command
> outstanding at once; reusing the qc means no allocation issues either.
> 
> The t10 approach does mean the SG_IO problem is actually fixable rather
> than simply erroring out.

It would be sort of fixable, but with a lot of hackery.

> That's up to you ... from the point of view of code documenting itself,
> forming the ATA_16 TRIM in sd and not doing any satl transformation is
> easier for others to follow, but if it's going to cause more code, I'm
> only marginal on the advantages of easier to follow code.

I tried this earlier before giving up on it because it looked to ugly.
But I can complete that version of it and post it for people to compare.

[PATCH 10/23] block: add a new BLKDEV_ZERO_NOFALLBACK flag

2017-03-23 Thread Christoph Hellwig

This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.

Also clean up the convoluted check for the return condition that this
new flag is added to.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c| 5 -
 include/linux/blkdev.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index a59bb54ac7c7..33c5bf373b7f 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -281,6 +281,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
  *
  *  If a device is using logical block provisioning, the underlying space will
  *  only be release if %flags contains BLKDEV_ZERO_UNMAP.
+ *
+ *  If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return
+ *  -EOPNOTSUPP if no explicit hardware offload for zeroing is provided.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -298,7 +301,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop, flags);
-   if (ret == 0 || (ret && ret != -EOPNOTSUPP))
+   if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
goto out;
 
ret = 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 19209416225d..1bddcfc631a4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1347,6 +1347,7 @@ extern int __blkdev_issue_discard(struct block_device 
*bdev, sector_t sector,
struct bio **biop);
 
 #define BLKDEV_ZERO_UNMAP  (1 << 0)  /* free unused blocks */
+#define BLKDEV_ZERO_NOFALLBACK (1 << 1)  /* don't write explicit zeroes */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-- 
2.11.0

[PATCH 07/23] block: stop using blkdev_issue_write_same for zeroing

2017-03-23 Thread Christoph Hellwig

We'll always use the WRITE ZEROES code for zeroing now.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index ed1e78e24db0..a93115ccf461 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -364,10 +364,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return 0;
}
 
-   if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
-   ZERO_PAGE(0)))
-   return 0;
-
blk_start_plug(&plug);
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
&bio, discard);
-- 
2.11.0

[PATCH 04/23] md: support REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/linear.c| 1 +
 drivers/md/md.h| 7 +++
 drivers/md/multipath.c | 1 +
 drivers/md/raid0.c | 2 ++
 drivers/md/raid1.c | 4 +++-
 drivers/md/raid10.c| 1 +
 drivers/md/raid5.c | 1 +
 7 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 3e38e0207a3e..377a8a3672e3 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -293,6 +293,7 @@ static void linear_make_request(struct mddev *mddev, struct 
bio *bio)
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
mddev_check_writesame(mddev, split);
+   mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index dde8ecb760c8..1e76d64ce180 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -709,4 +709,11 @@ static inline void mddev_check_writesame(struct mddev 
*mddev, struct bio *bio)
!bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
mddev->queue->limits.max_write_same_sectors = 0;
 }
+
+static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio 
*bio)
+{
+   if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
+   !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
+   mddev->queue->limits.max_write_zeroes_sectors = 0;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 79a12b59250b..e95d521d93e9 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -139,6 +139,7 @@ static void multipath_make_request(struct mddev *mddev, 
struct bio * bio)
mp_bh->bio.bi_end_io = multipath_end_request;
mp_bh->bio.bi_private = mp_bh;
mddev_check_writesame(mddev, &mp_bh->bio);
+   mddev_check_write_zeroes(mddev, &mp_bh->bio);
generic_make_request(&mp_bh->bio);
return;
 }
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 93347ca7c7a6..ce7a6a56cf73 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -383,6 +383,7 @@ static int raid0_run(struct mddev *mddev)
 
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 
mddev->chunk_sectors);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 
mddev->chunk_sectors);
blk_queue_max_discard_sectors(mddev->queue, 
mddev->chunk_sectors);
 
blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
@@ -504,6 +505,7 @@ static void raid0_make_request(struct mddev *mddev, struct 
bio *bio)
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
mddev_check_writesame(mddev, split);
+   mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a34f58772022..b59cc100320a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3177,8 +3177,10 @@ static int raid1_run(struct mddev *mddev)
if (IS_ERR(conf))
return PTR_ERR(conf);
 
-   if (mddev->queue)
+   if (mddev->queue) {
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+   }
 
rdev_for_each(rdev, mddev) {
if (!mddev->gendisk)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e89a8d78a9ed..28ec3a93acee 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3749,6 +3749,7 @@ static int raid10_run(struct mddev *mddev)
blk_queue_max_discard_sectors(mddev->queue,
  mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
blk_queue_io_min(mddev->queue, chunk_size);
if (conf->geo.raid_disks % conf->geo.near_copies)
blk_queue_io_opt(mddev->queue, chunk_size * 
conf->geo.raid_disks);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed5cd705b985..8cf1f86dcd05 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7272,6 +7272,7 @@ static int raid5_run(struct mddev *mddev)
mddev->queue->limits.discard_zeroes_data = 0;
 
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 
rdev_for_each(rdev, mddev) {
disk_s

[PATCH 15/23] loop: implement REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

It's identical to discard as hole punches will always leave us with
zeroes on reads.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/loop.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0ecb6461ed81..265cd2e33ff0 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -528,6 +528,7 @@ static int do_req_filebacked(struct loop_device *lo, struct 
request *rq)
case REQ_OP_FLUSH:
return lo_req_flush(lo, rq);
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
return lo_discard(lo, rq, pos);
case REQ_OP_WRITE:
if (lo->transfer)
@@ -826,6 +827,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = 0;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, 0);
+   blk_queue_max_write_zeroes_sectors(q, 0);
q->limits.discard_zeroes_data = 0;
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
return;
@@ -834,6 +836,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = inode->i_sb->s_blocksize;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+   blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
q->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -1660,6 +1663,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
switch (req_op(cmd->rq)) {
case REQ_OP_FLUSH:
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
cmd->use_aio = false;
break;
default:
-- 
2.11.0

[PATCH 12/23] sd: handle REQ_UNMAP

2017-03-23 Thread Christoph Hellwig

Try to use a write same with unmap bit variant if the device supports it
and the caller asks for it.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b6f70a09a301..ca96bb33471b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -871,6 +871,16 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd 
*cmd)
return BLKPREP_INVALID;
return sd_setup_ata_trim_cmnd(cmd);
}
+
+   if (rq->cmd_flags & REQ_UNMAP) {
+   switch (sdkp->provisioning_mode) {
+   case SD_LBP_WS16:
+   return sd_setup_write_same16_cmnd(cmd, true);
+   case SD_LBP_WS10:
+   return sd_setup_write_same10_cmnd(cmd, true);
+   }
+   }
+
if (sdp->no_write_same)
return BLKPREP_INVALID;
if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
-- 
2.11.0

[PATCH 05/23] dm: support REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-core.h  |  1 +
 drivers/md/dm-io.c| 10 +++---
 drivers/md/dm-linear.c|  1 +
 drivers/md/dm-mpath.c |  1 +
 drivers/md/dm-rq.c| 11 ---
 drivers/md/dm-stripe.c|  2 ++
 drivers/md/dm-table.c | 30 ++
 drivers/md/dm.c   | 31 ---
 include/linux/device-mapper.h |  6 ++
 9 files changed, 84 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..fea5bd52ada8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -132,6 +132,7 @@ void dm_init_md_queue(struct mapped_device *md);
 void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
 void disable_write_same(struct mapped_device *md);
+void disable_write_zeroes(struct mapped_device *md);
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject 
*kobj)
 {
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..fad2d5d1888e 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -312,9 +312,12 @@ static void do_region(int op, int op_flags, unsigned 
region,
 */
if (op == REQ_OP_DISCARD)
special_cmd_max_sectors = q->limits.max_discard_sectors;
+   else if (op == REQ_OP_WRITE_ZEROES)
+   special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
else if (op == REQ_OP_WRITE_SAME)
special_cmd_max_sectors = q->limits.max_write_same_sectors;
-   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_SAME) &&
+   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
+op == REQ_OP_WRITE_SAME)  &&
special_cmd_max_sectors == 0) {
dec_count(io, region, -EOPNOTSUPP);
return;
@@ -328,7 +331,8 @@ static void do_region(int op, int op_flags, unsigned region,
/*
 * Allocate a suitably sized-bio.
 */
-   if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_SAME))
+   if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_ZEROES) ||
+   (op == REQ_OP_WRITE_SAME))
num_bvecs = 1;
else
num_bvecs = min_t(int, BIO_MAX_PAGES,
@@ -341,7 +345,7 @@ static void do_region(int op, int op_flags, unsigned region,
bio_set_op_attrs(bio, op, op_flags);
store_io_and_region_in_bio(bio, io, region);
 
-   if (op == REQ_OP_DISCARD) {
+   if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) {
num_sectors = min_t(sector_t, special_cmd_max_sectors, 
remaining);
bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
remaining -= num_sectors;
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..e17fd44ceef5 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -59,6 +59,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
+   ti->num_write_zeroes_bios = 1;
ti->private = lc;
return 0;
 
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7f223dbed49f..ab55955ed704 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1103,6 +1103,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
+   ti->num_write_zeroes_bios = 1;
if (m->queue_mode == DM_TYPE_BIO_BASED)
ti->per_io_data_size = multipath_per_bio_data_size();
else
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 28955b94d2b2..6006d5d4b06e 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -298,9 +298,14 @@ static void dm_done(struct request *clone, int error, bool 
mapped)
r = rq_end_io(tio->ti, clone, error, &tio->info);
}
 
-   if (unlikely(r == -EREMOTEIO && (req_op(clone) == REQ_OP_WRITE_SAME) &&
-!clone->q->limits.max_write_same_sectors))
-   disable_write_same(tio->md);
+   if (unlikely(r == -EREMOTEIO)) {
+   if (req_op(clone) == REQ_OP_WRITE_SAME &&
+   !clone->q->limits.max_write_same_sectors)
+   disable_write_same(tio->md);
+   if (req_op(clone) == REQ_OP_WRITE_ZEROES &&
+   !clone->q->limits.max_write_zeroes_sectors)
+   disable_write_zeroes(tio->md);
+   }
 
if (r <= 0)
/* The target wants to complete the I/O */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-strip

[PATCH 11/23] block_dev: use blkdev_issue_zerout for hole punches

2017-03-23 Thread Christoph Hellwig

This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and preparse for removing the discard_zeroes_data flag.

Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.

Signed-off-by: Christoph Hellwig 
---
 fs/block_dev.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6128283b7830..5c701c16e8ff 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2074,10 +2074,10 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
 loff_t len)
 {
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
-   struct request_queue *q = bdev_get_queue(bdev);
struct address_space *mapping;
loff_t end = start + len - 1;
loff_t isize;
+   unsigned flags = 0;
int error;
 
/* Fail if we don't recognize the flags. */
@@ -2107,21 +2107,15 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
truncate_inode_pages_range(mapping, start, end);
 
switch (mode) {
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
+   flags = BLKDEV_ZERO_NOFALLBACK;
+   /*FALLTHRU*/
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
-   GFP_KERNEL, 0);
-   break;
-   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-   /* Only punch if the device can do zeroing discard. */
-   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
-   return -EOPNOTSUPP;
-   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
-GFP_KERNEL, 0);
+GFP_KERNEL, flags);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
-   if (!blk_queue_discard(q))
-   return -EOPNOTSUPP;
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
-- 
2.11.0

[PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 45 -
 drivers/scsi/sd_zbc.c |  1 +
 2 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index af632e350ab4..b6f70a09a301 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -748,7 +748,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
return scsi_init_io(cmd);
 }
 
-static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap)
 {
struct scsi_device *sdp = cmd->device;
struct request *rq = cmd->request;
@@ -765,13 +765,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd 
*cmd)
 
cmd->cmd_len = 16;
cmd->cmnd[0] = WRITE_SAME_16;
-   cmd->cmnd[1] = 0x8; /* UNMAP */
+   if (unmap)
+   cmd->cmnd[1] = 0x8; /* UNMAP */
put_unaligned_be64(sector, &cmd->cmnd[2]);
put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
cmd->allowed = SD_MAX_RETRIES;
cmd->transfersize = data_len;
-   rq->timeout = SD_TIMEOUT;
+   rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
scsi_req(rq)->resid_len = data_len;
 
return scsi_init_io(cmd);
@@ -801,7 +802,7 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd 
*cmd, bool unmap)
 
cmd->allowed = SD_MAX_RETRIES;
cmd->transfersize = data_len;
-   rq->timeout = SD_TIMEOUT;
+   rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
scsi_req(rq)->resid_len = data_len;
 
return scsi_init_io(cmd);
@@ -857,11 +858,39 @@ static int sd_setup_ata_trim_cmnd(struct scsi_cmnd *cmd)
return scsi_init_io(cmd);
 }
 
+static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
+{
+   struct request *rq = cmd->request;
+   struct scsi_device *sdp = cmd->device;
+   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+
+   if (sdp->ata_trim) {
+   if (!sdp->ata_trim_zeroes_data)
+   return BLKPREP_INVALID;
+   return sd_setup_ata_trim_cmnd(cmd);
+   }
+   if (sdp->no_write_same)
+   return BLKPREP_INVALID;
+   if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
+   return sd_setup_write_same16_cmnd(cmd, false);
+   return sd_setup_write_same10_cmnd(cmd, false);
+}
+
 static void sd_config_write_same(struct scsi_disk *sdkp)
 {
struct request_queue *q = sdkp->disk->queue;
unsigned int logical_block_size = sdkp->device->sector_size;
 
+   if (sdkp->device->ata_trim) {
+   if (sdkp->device->ata_trim_zeroes_data)
+   sdkp->max_ws_blocks = 65535 * (512 / sizeof(__le64));
+   else
+   sdkp->max_ws_blocks = 0;
+   goto config_write_zeroes;
+   }
+
if (sdkp->device->no_write_same) {
sdkp->max_ws_blocks = 0;
goto out;
@@ -886,6 +915,9 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 out:
blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
 (logical_block_size >> 9));
+config_write_zeroes:
+   blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks *
+(logical_block_size >> 9));
 }
 
 /**
@@ -1226,7 +1258,7 @@ static int sd_init_command(struct scsi_cmnd *cmd)
case SD_LBP_UNMAP:
return sd_setup_unmap_cmnd(cmd);
case SD_LBP_WS16:
-   return sd_setup_write_same16_cmnd(cmd);
+   return sd_setup_write_same16_cmnd(cmd, true);
case SD_LBP_WS10:
return sd_setup_write_same10_cmnd(cmd, true);
case SD_LBP_ZERO:
@@ -1236,6 +1268,8 @@ static int sd_init_command(struct scsi_cmnd *cmd)
default:
return BLKPREP_INVALID;
}
+   case REQ_OP_WRITE_ZEROES:
+   return sd_setup_write_zeroes_cmnd(cmd);
case REQ_OP_WRITE_SAME:
return sd_setup_write_same_cmnd(cmd);
case REQ_OP_FLUSH:
@@ -1876,6 +1910,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
switch (req_op(req)) {
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
if (!result) {
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 92620c8ea8ad..1994f7799fce 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -329,6 +329,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
 
switch (req_op(rq)) {
case REQ_OP_WRITE:
+   case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:

[PATCH 18/23] rsxx: remove the discard_zeroes_data flag

2017-03-23 Thread Christoph Hellwig

rsxx only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rsxx/dev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..9c566364ac9c 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -300,7 +300,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
RSXX_HW_BLK_SIZE >> 9);
card->queue->limits.discard_granularity = RSXX_HW_BLK_SIZE;
card->queue->limits.discard_alignment   = RSXX_HW_BLK_SIZE;
-   card->queue->limits.discard_zeroes_data = 1;
}
 
card->queue->queuedata = card;
-- 
2.11.0

[PATCH 09/23] block: add a REQ_UNMAP flag for REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.

Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c   | 19 +--
 include/linux/blk_types.h |  5 +
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 1e849f904d4c..a59bb54ac7c7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -226,20 +226,9 @@ int blkdev_issue_write_same(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
 
-/**
- * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES
- * @bdev:  blockdev to issue
- * @sector:start sector
- * @nr_sects:  number of sectors to write
- * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @biop:  pointer to anchor bio
- *
- * Description:
- *  Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled 
pages.
- */
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
-   struct bio **biop)
+   struct bio **biop, unsigned flags)
 {
struct bio *bio = *biop;
unsigned int max_write_zeroes_sectors;
@@ -258,7 +247,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
bio = next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
-   bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0);
+   bio->bi_opf = REQ_OP_WRITE_ZEROES;
+   if (flags & BLKDEV_ZERO_UNMAP)
+   bio->bi_opf |= REQ_UNMAP;
 
if (nr_sects > max_write_zeroes_sectors) {
bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -306,7 +297,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return -EINVAL;
 
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
-   biop);
+   biop, flags);
if (ret == 0 || (ret && ret != -EOPNOTSUPP))
goto out;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6393c13a6498..4da61fd12c94 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -187,6 +187,10 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD,   /* read ahead, can fail anytime */
__REQ_BACKGROUND,   /* background IO */
+
+   /* command specific flags for REQ_OP_WRITE_ZEROES: */
+   __REQ_UNMAP,/* explicitly release space when zeroing */
+
__REQ_NR_BITS,  /* stops here */
 };
 
@@ -203,6 +207,7 @@ enum req_flag_bits {
 #define REQ_PREFLUSH   (1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
+#define REQ_UNMAP  (1ULL << __REQ_UNMAP)
 
 #define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.11.0

[PATCH 08/23] block: add a flags argument to (__)blkdev_issue_zeroout

2017-03-23 Thread Christoph Hellwig

Turn the existin discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c  | 31 ++-
 block/ioctl.c|  3 +--
 drivers/nvme/target/io-cmd.c |  2 +-
 fs/block_dev.c   |  2 +-
 fs/dax.c |  2 +-
 fs/xfs/xfs_bmap_util.c   |  2 +-
 include/linux/blkdev.h   | 14 +-
 7 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index a93115ccf461..1e849f904d4c 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -282,14 +282,18 @@ static int __blkdev_issue_write_zeroes(struct 
block_device *bdev,
  * @nr_sects:  number of sectors to write
  * @gfp_mask:  memory allocation flags (for bio_alloc)
  * @biop:  pointer to anchor bio
- * @discard:   discard flag
+ * @flags: controls detailed behavior
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.
+ *
+ *  If a device is using logical block provisioning, the underlying space will
+ *  only be release if %flags contains BLKDEV_ZERO_UNMAP.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-   bool discard)
+   unsigned flags)
 {
int ret;
int bi_size = 0;
@@ -337,28 +341,21 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout);
  * @sector:start sector
  * @nr_sects:  number of sectors to write
  * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @discard:   whether to discard the block range
+ * @flags: controls detailed behavior
  *
  * Description:
- *  Zero-fill a block range.  If the discard flag is set and the block
- *  device guarantees that subsequent READ operations to the block range
- *  in question will return zeroes, the blocks will be discarded. Should
- *  the discard request fail, if the discard flag is not set, or if
- *  discard_zeroes_data is not supported, this function will resort to
- *  zeroing the blocks manually, thus provisioning (allocating,
- *  anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME
- *  command(s), blkdev_issue_zeroout() will use it to optimize the process of
- *  clearing the block range. Otherwise the zeroing will be performed
- *  using regular WRITE calls.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.  See __blkdev_issue_zeroout() for the
+ *  valid values for %flags.
  */
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-sector_t nr_sects, gfp_t gfp_mask, bool discard)
+   sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
int ret;
struct bio *bio = NULL;
struct blk_plug plug;
 
-   if (discard) {
+   if (flags & BLKDEV_ZERO_UNMAP) {
if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
BLKDEV_DISCARD_ZERO))
return 0;
@@ -366,7 +363,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
blk_start_plug(&plug);
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
-   &bio, discard);
+   &bio, flags);
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
diff --git a/block/ioctl.c b/block/ioctl.c
index 7b88820b93d9..304685022d9f 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -254,8 +254,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, 
fmode_t mode,
mapping = bdev->bd_inode->i_mapping;
truncate_inode_pages_range(mapping, start, end);
 
-   return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
-   false);
+   return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL, 0);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 4195115c7e54..1366babaee50 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -184,7 +184,7 @@ static void nvmet_execute_write_zeroes(struct nvmet_req 
*req)
(req->ns->blksize_shift - 9)) + 1;
 
if (__blkdev_issue_zeroout(req->ns->bdev, sector, nr_sector,
-   GFP_KERNEL, &bio, true))
+   GFP_KERNEL, &bio, BLKDEV_ZERO_UNMAP))
status = NVME_SC_INTERNAL | NVME_SC_DNR;
 
if (bio) {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..6128283b7830 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2110,7 +2110,7 @@ static long blkdev_fallocate(struct file *file, int mod

[PATCH 13/23] nvme: implement REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing.  Also rename the quirk flag to be a bit
more self-explanatory.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 10 +-
 drivers/nvme/host/nvme.h |  6 +++---
 drivers/nvme/host/pci.c  |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9b3b57fef446..bfa0595dc767 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -335,6 +335,8 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
case REQ_OP_FLUSH:
nvme_setup_flush(ns, cmd);
break;
+   case REQ_OP_WRITE_ZEROES:
+   /* currently only aliased to deallocate for a few ctrls: */
case REQ_OP_DISCARD:
ret = nvme_setup_discard(ns, req, cmd);
break;
@@ -900,16 +902,14 @@ static void nvme_config_discard(struct nvme_ns *ns)
BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
NVME_DSM_MAX_RANGES);
 
-   if (ctrl->quirks & NVME_QUIRK_DISCARD_ZEROES)
-   ns->queue->limits.discard_zeroes_data = 1;
-   else
-   ns->queue->limits.discard_zeroes_data = 0;
-
ns->queue->limits.discard_alignment = logical_block_size;
ns->queue->limits.discard_granularity = logical_block_size;
blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+
+   if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
+   blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
 }
 
 static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2aa20e3e5675..07ebc4a1c8fc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -68,10 +68,10 @@ enum nvme_quirks {
NVME_QUIRK_IDENTIFY_CNS = (1 << 1),
 
/*
-* The controller deterministically returns O's on reads to discarded
-* logical blocks.
+* The controller deterministically returns O's on reads to
+* logical blocks that deallocate was called on.
 */
-   NVME_QUIRK_DISCARD_ZEROES   = (1 << 2),
+   NVME_QUIRK_DEALLOCATE_ZEROES= (1 << 2),
 
/*
 * The controller needs a delay before starts checking the device
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 26a5fd05fe88..0a28787267f0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2135,13 +2135,13 @@ static const struct pci_error_handlers nvme_err_handler 
= {
 static const struct pci_device_id nvme_id_table[] = {
{ PCI_VDEVICE(INTEL, 0x0953),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x0a53),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x0a54),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x5845),   /* Qemu emulated controller */
.driver_data = NVME_QUIRK_IDENTIFY_CNS, },
{ PCI_DEVICE(0x1c58, 0x0003),   /* HGST adapter */
-- 
2.11.0

[PATCH 20/23] block: stop using discards for zeroing

2017-03-23 Thread Christoph Hellwig

Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that
support efficient zeroing of devices we can remove the call to
blkdev_issue_discard.  This means we only have two ways of zeroing left
and can simply the code.

Signed-off-by: Christoph Hellwig 
---
 block/blk-lib.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 33c5bf373b7f..153ca59393a7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -279,6 +279,11 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
  *  Zero-fill a block range, either using hardware offload or by explicitly
  *  writing zeroes to the device.
  *
+ *  Note that this function may fail with -EOPNOTSUPP if the driver supports
+ *  efficient zeroing operation, but the device capabilities can only be
+ *  discovered by trial and error.  In this case the caller should call the
+ *  function again, and it will use the fallback path.
+ *
  *  If a device is using logical block provisioning, the underlying space will
  *  only be release if %flags contains BLKDEV_ZERO_UNMAP.
  *
@@ -349,12 +354,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
struct bio *bio = NULL;
struct blk_plug plug;
 
-   if (flags & BLKDEV_ZERO_UNMAP) {
-   if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
-   BLKDEV_DISCARD_ZERO))
-   return 0;
-   }
-
blk_start_plug(&plug);
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
&bio, flags);
-- 
2.11.0

[PATCH 14/23] zram: implement REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

Just the same as discard if the block size equals the system page size.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/zram/zram_drv.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dceb5edd1e54..1710b06f04a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -829,10 +829,14 @@ static void __zram_make_request(struct zram *zram, struct 
bio *bio)
offset = (bio->bi_iter.bi_sector &
  (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
 
-   if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+   switch (bio_op(bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
zram_bio_discard(zram, index, offset, bio);
bio_endio(bio);
return;
+   default:
+   break;
}
 
bio_for_each_segment(bvec, bio, iter) {
@@ -1192,6 +1196,8 @@ static int zram_add(void)
zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE;
zram->disk->queue->limits.chunk_sectors = 0;
blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX);
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+
/*
 * zram_bio_discard() will clear all logical blocks if logical block
 * size is identical with physical block size(PAGE_SIZE). But if it is
@@ -1201,10 +1207,7 @@ static int zram_add(void)
 * zeroed.
 */
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
-   zram->disk->queue->limits.discard_zeroes_data = 1;
-   else
-   zram->disk->queue->limits.discard_zeroes_data = 0;
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+   blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
add_disk(zram->disk);
 
-- 
2.11.0

[PATCH 21/23] drbd: make intelligent use of blkdev_issue_zeroout

2017-03-23 Thread Christoph Hellwig

drbd always wants its discard wire operations to zero the blocks, so
use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
reinventing it poorly.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_debugfs.c  |  3 --
 drivers/block/drbd/drbd_int.h  |  6 ---
 drivers/block/drbd/drbd_receiver.c | 99 ++
 drivers/block/drbd/drbd_req.c  |  6 +--
 4 files changed, 7 insertions(+), 107 deletions(-)

diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index de5c3ee8a790..494837e59f23 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file 
*m, struct drbd_peer_re
seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, &sep, "in-AL");
seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, &sep, "C");
seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, &sep, "set-in-sync");
-
-   if (f & EE_IS_TRIM)
-   __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, &sep, 
"zero-out", "trim");
seq_print_rq_state_bit(m, f & EE_WRITE_SAME, &sep, "write-same");
seq_putc(m, '\n');
 }
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 724d1c50fc52..d5da45bb03a6 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -437,9 +437,6 @@ enum {
 
/* is this a TRIM aka REQ_DISCARD? */
__EE_IS_TRIM,
-   /* our lower level cannot handle trim,
-* and we want to fall back to zeroout instead */
-   __EE_IS_TRIM_USE_ZEROOUT,
 
/* In case a barrier failed,
 * we need to resubmit without the barrier flag. */
@@ -482,7 +479,6 @@ enum {
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
 #define EE_IS_TRIM (1<<__EE_IS_TRIM)
-#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT)
 #define EE_RESUBMITTED (1<<__EE_RESUBMITTED)
 #define EE_WAS_ERROR   (1<<__EE_WAS_ERROR)
 #define EE_HAS_DIGEST  (1<<__EE_HAS_DIGEST)
@@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data);
 extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
 
 /* drbd_receiver.c */
-extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
-   sector_t start, unsigned int nr_sectors, bool discard);
 extern int drbd_receiver(struct drbd_thread *thi);
 extern int drbd_ack_receiver(struct drbd_thread *thi);
 extern void drbd_send_ping_wf(struct work_struct *ws);
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index aa6bf9692eff..8641776f043a 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1448,105 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource 
*resource, struct drbd_backin
drbd_info(resource, "Method to ensure write ordering: %s\n", 
write_ordering_str[resource->write_ordering]);
 }
 
-/*
- * We *may* ignore the discard-zeroes-data setting, if so configured.
- *
- * Assumption is that it "discard_zeroes_data=0" is only because the backend
- * may ignore partial unaligned discards.
- *
- * LVM/DM thin as of at least
- *   LVM version: 2.02.115(2)-RHEL7 (2015-01-28)
- *   Library version: 1.02.93-RHEL7 (2015-01-28)
- *   Driver version:  4.29.0
- * still behaves this way.
- *
- * For unaligned (wrt. alignment and granularity) or too small discards,
- * we zero-out the initial (and/or) trailing unaligned partial chunks,
- * but discard all the aligned full chunks.
- *
- * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
- */
-int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, 
unsigned int nr_sectors, bool discard)
-{
-   struct block_device *bdev = device->ldev->backing_bdev;
-   struct request_queue *q = bdev_get_queue(bdev);
-   sector_t tmp, nr;
-   unsigned int max_discard_sectors, granularity;
-   int alignment;
-   int err = 0;
-
-   if (!discard)
-   goto zero_out;
-
-   /* Zero-sector (unknown) and one-sector granularities are the same.  */
-   granularity = max(q->limits.discard_granularity >> 9, 1U);
-   alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
-   max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22));
-   max_discard_sectors -= max_discard_sectors % granularity;
-   if (unlikely(!max_discard_sectors))
-   goto zero_out;
-
-   if (nr_sectors < granularity)
-   goto zero_out;
-
-   tmp = start;
-   if (sector_div(tmp, granularity) != alignment) {
-   if (nr_sectors < 2*granularity)
-   goto zero_out;
-   /* start + gran - (start + gran - align) % gran */
-   tmp = start + granularity -

[PATCH 23/23] block: remove the discard_zeroes_data flag

2017-03-23 Thread Christoph Hellwig

Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig 
---
 Documentation/ABI/testing/sysfs-block | 10 ++
 Documentation/block/queue-sysfs.txt   |  5 -
 block/blk-lib.c   |  7 +--
 block/blk-settings.c  |  3 ---
 block/blk-sysfs.c |  2 +-
 block/compat_ioctl.c  |  2 +-
 block/ioctl.c |  2 +-
 drivers/ata/libata-scsi.c |  2 +-
 drivers/block/drbd/drbd_main.c|  2 --
 drivers/block/drbd/drbd_nl.c  |  7 +--
 drivers/block/loop.c  |  2 --
 drivers/block/mtip32xx/mtip32xx.c |  1 -
 drivers/block/nbd.c   |  1 -
 drivers/md/dm-cache-target.c  |  1 -
 drivers/md/dm-crypt.c |  1 -
 drivers/md/dm-raid.c  |  6 +++---
 drivers/md/dm-raid1.c |  1 -
 drivers/md/dm-table.c | 19 ---
 drivers/md/dm-thin.c  |  2 --
 drivers/md/raid5.c| 12 +---
 drivers/scsi/sd.c |  7 ---
 drivers/target/target_core_device.c   |  2 +-
 include/linux/blkdev.h| 15 ---
 include/linux/device-mapper.h |  5 -
 24 files changed, 13 insertions(+), 104 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 2da04ce6aeef..dea212db9df3 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -213,14 +213,8 @@ What:  
/sys/block//queue/discard_zeroes_data
 Date:  May 2011
 Contact:   Martin K. Petersen 
 Description:
-   Devices that support discard functionality may return
-   stale or random data when a previously discarded block
-   is read back. This can cause problems if the filesystem
-   expects discarded blocks to be explicitly cleared. If a
-   device reports that it deterministically returns zeroes
-   when a discarded area is read the discard_zeroes_data
-   parameter will be set to one. Otherwise it will be 0 and
-   the result of reading a discarded area is undefined.
+   Will always return 0.  Don't rely on any specific behavior
+   for discards, and don't read this file.
 
 What:  /sys/block//queue/write_same_max_bytes
 Date:  January 2012
diff --git a/Documentation/block/queue-sysfs.txt 
b/Documentation/block/queue-sysfs.txt
index c0a3bb5a6e4e..009150ed7db8 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -43,11 +43,6 @@ large discards are issued, setting this value lower will 
make Linux issue
 smaller discards and potentially help reduce latencies induced by large
 discard operations.
 
-discard_zeroes_data (RO)
-
-When read, this file will show if the discarded block are zeroed by the
-device or not. If its value is '1' the blocks are zeroed otherwise not.
-
 hw_sector_size (RO)
 ---
 This is the hardware sector size of the device, in bytes.
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 153ca59393a7..be44d2725ede 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -37,17 +37,12 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
return -ENXIO;
 
if (flags & BLKDEV_DISCARD_SECURE) {
-   if (flags & BLKDEV_DISCARD_ZERO)
-   return -EOPNOTSUPP;
if (!blk_queue_secure_erase(q))
return -EOPNOTSUPP;
op = REQ_OP_SECURE_ERASE;
} else {
if (!blk_queue_discard(q))
return -EOPNOTSUPP;
-   if ((flags & BLKDEV_DISCARD_ZERO) &&
-   !q->limits.discard_zeroes_data)
-   return -EOPNOTSUPP;
op = REQ_OP_DISCARD;
}
 
@@ -126,7 +121,7 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
&bio);
if (!ret && bio) {
ret = submit_bio_wait(bio);
-   if (ret == -EOPNOTSUPP && !(flags & BLKDEV_DISCARD_ZERO))
+   if (ret == -EOPNOTSUPP)
ret = 0;
bio_put(bio);
}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 9d515ae3a405..fe2794986eb9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -104,7 +104,6 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->discard_granularity = 0;
lim->discard_alignment = 0;
lim->discard_misaligned = 0;
-   lim->discard_zeroes_data = 0;
lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
lim->align

[PATCH 17/23] rbd: remove the discard_zeroes_data flag

2017-03-23 Thread Christoph Hellwig

rbd only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rbd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 517838b65964..0ec3b430e81d 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4380,7 +4380,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
q->limits.discard_granularity = segment_size;
q->limits.discard_alignment = segment_size;
blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
-   q->limits.discard_zeroes_data = 1;
 
if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
-- 
2.11.0

[PATCH 19/23] mmc: remove the discard_zeroes_data flag

2017-03-23 Thread Christoph Hellwig

mmc only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
---
 drivers/mmc/core/queue.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 493eb10ce580..4c54ad34e17a 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -167,8 +167,6 @@ static void mmc_queue_setup_discard(struct request_queue *q,
 
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
blk_queue_max_discard_sectors(q, max_discard);
-   if (card->erased_byte == 0 && !mmc_can_discard(card))
-   q->limits.discard_zeroes_data = 1;
q->limits.discard_granularity = card->pref_erase << 9;
/* granularity must not be greater than max. discard */
if (card->pref_erase > max_discard)
-- 
2.11.0

[PATCH 01/23] block: renumber REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blk_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb55d0f..6393c13a6498 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -160,7 +160,7 @@ enum req_opf {
/* write the same sector many times */
REQ_OP_WRITE_SAME   = 7,
/* write the zero filled sector many times */
-   REQ_OP_WRITE_ZEROES = 8,
+   REQ_OP_WRITE_ZEROES = 9,
 
/* SCSI passthrough using struct scsi_request */
REQ_OP_SCSI_IN  = 32,
-- 
2.11.0

[PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

It seems like DRBD assumes its on the wire TRIM request always zeroes data.
Use that fact to implement REQ_OP_WRITE_ZEROES.

XXX: will need a careful audit from the drbd team!

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_main.c | 3 ++-
 drivers/block/drbd/drbd_nl.c   | 2 ++
 drivers/block/drbd/drbd_receiver.c | 6 +++---
 drivers/block/drbd/drbd_req.c  | 7 +--
 drivers/block/drbd/drbd_worker.c   | 4 +++-
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..8e62d9f65510 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1668,7 +1668,8 @@ static u32 bio_flags_to_wire(struct drbd_connection 
*connection,
(bio->bi_opf & REQ_FUA ? DP_FUA : 0) |
(bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) |
(bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) |
-   (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0);
+   (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) |
+   (bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0);
else
return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0;
 }
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 908c704e20aa..e4516d3b971d 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1217,10 +1217,12 @@ static void decide_on_discard_support(struct 
drbd_device *device,
blk_queue_discard_granularity(q, 512);
q->limits.max_discard_sectors = 
drbd_max_discard_sectors(connection);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+   q->limits.max_write_zeroes_sectors = 
drbd_max_discard_sectors(connection);
} else {
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
blk_queue_discard_granularity(q, 0);
q->limits.max_discard_sectors = 0;
+   q->limits.max_write_zeroes_sectors = 0;
}
 }
 
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 8641776f043a..e64e01ca4d1a 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -2285,7 +2285,7 @@ static unsigned long wire_flags_to_bio_flags(u32 dpf)
 static unsigned long wire_flags_to_bio_op(u32 dpf)
 {
if (dpf & DP_DISCARD)
-   return REQ_OP_DISCARD;
+   return REQ_OP_WRITE_ZEROES;
else
return REQ_OP_WRITE;
 }
@@ -2476,7 +2476,7 @@ static int receive_Data(struct drbd_connection 
*connection, struct packet_info *
op_flags = wire_flags_to_bio_flags(dp_flags);
if (pi->cmd == P_TRIM) {
D_ASSERT(peer_device, peer_req->i.size > 0);
-   D_ASSERT(peer_device, op == REQ_OP_DISCARD);
+   D_ASSERT(peer_device, op == REQ_OP_WRITE_ZEROES);
D_ASSERT(peer_device, peer_req->pages == NULL);
} else if (peer_req->pages == NULL) {
D_ASSERT(device, peer_req->i.size == 0);
@@ -4789,7 +4789,7 @@ static int receive_rs_deallocated(struct drbd_connection 
*connection, struct pac
 
if (get_ldev(device)) {
struct drbd_peer_request *peer_req;
-   const int op = REQ_OP_DISCARD;
+   const int op = REQ_OP_WRITE_ZEROES;
 
peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector,
   size, 0, GFP_NOIO);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index bdd42b8360cc..d76ee53ce528 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -59,6 +59,7 @@ static struct drbd_request *drbd_req_new(struct drbd_device 
*device, struct bio
drbd_req_make_private_bio(req, bio_src);
req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0)
  | (bio_op(bio_src) == REQ_OP_WRITE_SAME ? RQ_WSAME : 0)
+ | (bio_op(bio_src) == REQ_OP_WRITE_ZEROES ? RQ_UNMAP : 0)
  | (bio_op(bio_src) == REQ_OP_DISCARD ? RQ_UNMAP : 0);
req->device = device;
req->master_bio = bio_src;
@@ -1180,7 +1181,8 @@ drbd_submit_req_private_bio(struct drbd_request *req)
if (get_ldev(device)) {
if (drbd_insert_fault(device, type))
bio_io_error(bio);
-   else if (bio_op(bio) == REQ_OP_DISCARD)
+   else if (bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+bio_op(bio) == REQ_OP_DISCARD)
drbd_process_discard_req(req);
else
generic_make_request(bio);
@@ -1234,7 +1236,8 @@ drbd_request_prepare(struct drbd_device *device, struct 
bio *bio, unsigned long
_drbd_start_io_acct(device, req);
 
/* process discards alwa

Re: support ranges TRIM for libata

2017-03-23 Thread James Bottomley

On Thu, 2017-03-23 at 14:55 +0100, Christoph Hellwig wrote:
> On Thu, Mar 23, 2017 at 09:47:41AM -0400, James Bottomley wrote:
> > > The current implementation already has the issue of that it does
> > > corrupt user data reliably if the using SG_IO for WRITE SAME
> > > commands.
> > 
> > That does need fixing.
> 
> I don't think it's fixable as long as we translate the data payload.
> 
> > Why can't we do what the t10 sat document recommends: if the ATA 
> > device doesn't support the XL version (32 bit ranges) then 
> > translate unmap to multiple non-XL commands?
> 
> Because libata sits underneath the tag allocator we'd getting into
> a giant set of problems.  Where do you expect the new commands to
> magically come from?  And adhere to the queueing limits and not 
> actually deadlock in one way or another?

I'm certainly not saying we blindly follow t10, but I believe their
intent is to issue the next command from the completion of the first
(we can do this using qc->complete_fn, like atapi_request_sense).  That
way we don't get any tag problems because there's only one command
outstanding at once; reusing the qc means no allocation issues either.

The t10 approach does mean the SG_IO problem is actually fixable rather
than simply erroring out.

> > I don't necessarily object to the vendor specific 1<->1 approach, 
> > it's just it won't fix the problem you cited above (SG_IO WRITE 
> > SAME), its just that now we error the command, which may cause some
> > surprise.
> 
> We now error WRITE SAME for passthrough consistently.  Before we only
> accepted it only with the unmap bit set, and even did so incorrectly
> (not checking that the payload was all zeros).

If it never worked for anyone, I'm OK with changing it to error out.

> > I also wonder if we couldn't simply do an ATA_16 TRIM if we're
> > already going to all the trouble of recognising ATA devices in the 
> > sd discard path?
> 
> We probably could do that as well.  But that would drag a lot more
> ATA-specific code into sd.c than just formatting the ranges.

That's up to you ... from the point of view of code documenting itself,
forming the ATA_16 TRIM in sd and not doing any satl transformation is
easier for others to follow, but if it's going to cause more code, I'm
only marginal on the advantages of easier to follow code.

James

[PATCH 06/23] dm-kcopyd: switch to use REQ_OP_WRITE_ZEROES

2017-03-23 Thread Christoph Hellwig

It seems like the code currently passes whatever it was using for writes
to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
need any payload.

Untested, and confused by the code, maybe someone who understands it
better than me can help..

Not-yet-signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-kcopyd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 9e9d04cb7d51..f85846741d50 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct 
dm_io_region *from,
job->pages = &zero_page_list;
 
/*
-* Use WRITE SAME to optimize zeroing if all dests support it.
+* Use WRITE ZEROES to optimize zeroing if all dests support it.
 */
-   job->rw = REQ_OP_WRITE_SAME;
+   job->rw = REQ_OP_WRITE_ZEROES;
for (i = 0; i < job->num_dests; i++)
-   if (!bdev_write_same(job->dests[i].bdev)) {
+   if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) {
job->rw = WRITE;
break;
}
-- 
2.11.0

[PATCH 02/23] block: implement splitting of REQ_OP_WRITE_ZEROES bios

2017-03-23 Thread Christoph Hellwig

Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.

Signed-off-by: Christoph Hellwig 
---
 block/blk-merge.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index c62a6f0325e0..f9e5a3e0944b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -63,6 +63,20 @@ static struct bio *blk_bio_discard_split(struct 
request_queue *q,
return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
+   struct bio *bio, struct bio_set *bs, unsigned *nsegs)
+{
+   *nsegs = 1;
+
+   if (!q->limits.max_write_zeroes_sectors)
+   return NULL;
+
+   if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+   return NULL;
+
+   return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
@@ -209,8 +223,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
split = blk_bio_discard_split(q, *bio, bs, &nsegs);
break;
case REQ_OP_WRITE_ZEROES:
-   split = NULL;
-   nsegs = (*bio)->bi_phys_segments;
+   split = blk_bio_write_zeroes_split(q, *bio, bs, &nsegs);
break;
case REQ_OP_WRITE_SAME:
split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
-- 
2.11.0

RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-23 Thread Christoph Hellwig

This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
supported by the block layer, and switches existing implementations
of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
removes incorrect discard_zeroes_data, and also switches WRITE SAME
based zeroing in SCSI to this new method.

I've done testing with ATA, SCSI and NVMe setups, but there are
a few things that will need more attention:

 - what is dm-kcopyd doing with the current WRITE SAME usage
   that gets multiple biovecs?  Can we fix it up in any way.
 - what are we going to do with discards through parity raid?
   I suspect we should either just disable it for now entirely
   or modify raid5.c to issue REQ_OP_WRITE_ZEROES to the underlying
   devices.  But I need someone with such a setup to test it.
 - The DRBD code in this area was very odd, and will need an
   audit from the maintainers.

Note that this series needs to be applied on top of the
"support ranges TRIM for libata" series.

A git tree is also avaiable at:

git://git.infradead.org/users/hch/block.git discard-rework

Gitweb:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework

Re: [PATCH] qedf: fix wrong le16 conversion

2017-03-23 Thread Martin K. Petersen

Arnd Bergmann  writes:

> gcc points out that we are converting a 16-bit integer into a 32-bit
> little-endian type and assigning that to 16-bit little-endian
> will end up with a zero:
>
> drivers/scsi/qedf/drv_fcoe_fw_funcs.c: In function 
> 'init_initiator_rw_fcoe_task':
> include/uapi/linux/byteorder/big_endian.h:32:26: error: large integer 
> implicitly truncated to unsigned type [-Werror=overflow]
>   t_st_ctx->read_write.rx_id = cpu_to_le32(FCOE_RX_ID);
>
> The correct solution appears to be to just use a 16-bit byte swap instead.
>
> Fixes: be086e7c53f1 ("qed*: Utilize Firmware 8.15.3.0")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/scsi/qedf/drv_fcoe_fw_funcs.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Dave: Since you queued the firmware patch, mind taking this fix through
your tree?

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] hpsa: fix volume offline state

2017-03-23 Thread Martin K. Petersen

Tomas Henzl  writes:

> In a previous patch a hpsa_scsi_dev_t.volume_offline update line
> has been removed, so let us put it back..
>
> Fixes: 85b29008d8 (hpsa: update check for logical volume status)

Applied to 4.11/scsi-fixes.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH v2 00/10] be2iscsi: driver update 11.4.0.0

2017-03-23 Thread Martin K. Petersen

Jitendra Bhivare  writes:

Somebody please review!

Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH 1/6] ipr: Fix missed EH wakeup

2017-03-23 Thread Martin K. Petersen

Brian King  writes:

> Following a command abort or device reset, ipr's EH handlers wait
> for the commands getting aborted to get sent back from the adapter
> prior to returning from the EH handler. This fixes up some cases
> where the completion handler was not getting called, which would
> have resulted in the EH thread waiting until it timed out, greatly
> extending EH time.

Somebody please review this series so I can get it queued up.

Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH 1/1] fnic: Adding Check Condition counter to misc fnicstats

2017-03-23 Thread Martin K. Petersen

"Satish Kharat (satishkh)"  writes:

Satish,

> Apologies for the delay. I was not able to verify this because of
> another fnic issue blocking this test. Just now submitted a fix for
> that 'fnic issue' (in the patch => [PATCH 1/1] fnic: bug fix for
> fip.fip_subcode in fnic_fcoe_send_vlan_req)
>
> Did some quick verification and basic IO test, this patch looks good.

OK, I have applied your patches as well as Christoph's
pci_alloc_irq_vectors() change to 4.12/scsi-queue.

Thank you!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: support ranges TRIM for libata

2017-03-23 Thread Christoph Hellwig

On Thu, Mar 23, 2017 at 09:47:41AM -0400, James Bottomley wrote:
> > The current implementation already has the issue of that it does
> > corrupt user data reliably if the using SG_IO for WRITE SAME
> > commands.
> 
> That does need fixing.

I don't think it's fixable as long as we translate the data payload.

> Why can't we do what the t10 sat document recommends: if the ATA device
> doesn't support the XL version (32 bit ranges) then translate unmap to
> multiple non-XL commands?

Because libata sits underneath the tag allocator we'd getting into
a giant set of problems.  Where do you expect the new commands to
magically come from?  And adhere to the queueing limits and not actually
deadlock in one way or another?

> I don't necessarily object to the vendor specific 1<->1 approach, it's
> just it won't fix the problem you cited above (SG_IO WRITE SAME), its
> just that now we error the command, which may cause some surprise.

We now error WRITE SAME for passthrough consistently.  Before we only
accepted it only with the unmap bit set, and even did so incorrectly
(not checking that the payload was all zeros).

> I
> also wonder if we couldn't simply do an ATA_16 TRIM if we're already
> going to all the trouble of recognising ATA devices in the sd discard
> path?

We probably could do that as well.  But that would drag a lot more
ATA-specific code into sd.c than just formatting the ranges.

[PATCH] qedf: Fix crash due to unsolicited FIP VLAN response.

2017-03-23 Thread Dupuis, Chad

From: Chad Dupuis 

We need to initialize qedf->fipvlan_compl in __qedf_probe so that if we receive
an unsolicited FIP VLAN response, the system doesn't crash due to trying to
complete an uninitialized completion.

Also add a check to see if there are any waiters on the completion so we don't
inadvertantly kick start the discovery process due to the unsolicited frame.

Fixed the crash:

<1>BUG: unable to handle kernel NULL pointer dereference at (null)
<1>IP: [] __wake_up_common+0x31/0x90
<4>PGD 0
<4>Oops:  [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 7
<4>Modules linked in: autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc 
target_core_iblock target_core_file target_core_pscsi target_core_mod configfs 
bnx2fc cnic fcoe 8021q garp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 
iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state 
nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat uinput ipmi_devintf 
microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt 
iTCO_vendor_support dcdbas sg joydev sb_edac edac_core lpc_ich mfd_core shpchp 
tg3 ptp pps_core ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif qedi(U) 
iscsi_boot_sysfs libiscsi scsi_transport_iscsi uio qedf(U) libfcoe libfc 
scsi_transport_fc scsi_tgt qede(U) qed(U) ahci megaraid_sas wmi dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
<4>
<4>Pid: 1485, comm: qedf_11_ll2 Not tainted 2.6.32-642.el6.x86_64 #1 Dell Inc. 
PowerEdge R730/0599V5
<4>RIP: 0010:[]  [] 
__wake_up_common+0x31/0x90
<4>RSP: 0018:881068a83d50  EFLAGS: 00010086
<4>RAX: ffe8 RBX: 88106bf42de0 RCX: 
<4>RDX:  RSI: 0003 RDI: 88106bf42de0
<4>RBP: 881068a83d90 R08:  R09: fffe
<4>R10:  R11: 000b R12: 0286
<4>R13: 88106bf42de8 R14:  R15: 
<4>FS:  () GS:88089c46() knlGS:
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
<4>CR2:  CR3: 01a8d000 CR4: 001407e0
<4>DR0:  DR1:  DR2: 
<4>DR3:  DR6: 0ff0 DR7: 0400
<4>Process qedf_11_ll2 (pid: 1485, threadinfo 881068a8, task 
881068a70040)
<4>Stack:
<4> 88106ef00090 00030001 881068a83d90 88106bf42de0
<4> 0286 88106bf42dd8 88106bf40a50 0002
<4> 881068a83dc0 810634c7 8813 000b
<4>Call Trace:
<4> [] complete+0x47/0x60
<4> [] qedf_fip_recv+0x1c7/0x450 [qedf]
<4> [] qedf_ll2_recv_thread+0x33b/0x510 [qedf]
<4> [] ? qedf_ll2_recv_thread+0x0/0x510 [qedf]
<4> [] kthread+0x9e/0xc0
<4> [] child_rip+0xa/0x20
<4> [] ? kthread+0x0/0xc0
<4> [] ? child_rip+0x0/0x20
<4>Code: 41 56 41 55 41 54 53 48 83 ec 18 0f 1f 44 00 00 89 75 cc 89 55 c8 4c 
8d 6f 08 48 8b 57 08 41 89 cf 4d 89 c6 48 8d 42 e8 49 39 d5 <48> 8b 58 18 74 3f 
48 83 eb 18 eb 0a 0f 1f 00 48 89 d8 48 8d 5a
<1>RIP  [] __wake_up_common+0x31/0x90
<4> RSP 
<4>CR2: 

Signed-off-by: Chad Dupuis 
---
 drivers/scsi/qedf/qedf_fip.c  | 3 ++-
 drivers/scsi/qedf/qedf_main.c | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/qedf/qedf_fip.c b/drivers/scsi/qedf/qedf_fip.c
index ed58b91..e10b91c 100644
--- a/drivers/scsi/qedf/qedf_fip.c
+++ b/drivers/scsi/qedf/qedf_fip.c
@@ -99,7 +99,8 @@ static void qedf_fcoe_process_vlan_resp(struct qedf_ctx *qedf,
qedf_set_vlan_id(qedf, vid);
 
/* Inform waiter that it's ok to call fcoe_ctlr_link up() */
-   complete(&qedf->fipvlan_compl);
+   if (!completion_done(&qedf->fipvlan_compl))
+   complete(&qedf->fipvlan_compl);
}
 }
 
diff --git a/drivers/scsi/qedf/qedf_main.c b/drivers/scsi/qedf/qedf_main.c
index 8e2a160..cceddd9 100644
--- a/drivers/scsi/qedf/qedf_main.c
+++ b/drivers/scsi/qedf/qedf_main.c
@@ -2803,6 +2803,7 @@ static int __qedf_probe(struct pci_dev *pdev, int mode)
atomic_set(&qedf->num_offloads, 0);
qedf->stop_io_on_error = false;
pci_set_drvdata(pdev, qedf);
+   init_completion(&qedf->fipvlan_compl);
 
QEDF_INFO(&(qedf->dbg_ctx), QEDF_LOG_INFO,
   "QLogic FastLinQ FCoE Module qedf %s, "
-- 
1.8.5.6

Re: support ranges TRIM for libata

2017-03-23 Thread James Bottomley

On Wed, 2017-03-22 at 19:19 +0100, Christoph Hellwig wrote:
> On Tue, Mar 21, 2017 at 02:59:01PM -0400, Tejun Heo wrote:
> > I do like the fact that this is a lot simpler than the previous
> > implementation but am not quite sure we want to deviate 
> > significantly from what we do for other commands (command 
> > translation).  Is it because fixing the existing implementation 
> > would involve invaisve changes including memory allocations?
> 
> The current implementation already has the issue of that it does
> corrupt user data reliably if the using SG_IO for WRITE SAME
> commands.

That does need fixing.

> Doing ranges using translation would turn into a nightmare because
> ATA TRIM ranges are 16 bits long while SCSI UNAMP ranges are 32-bit,
> so we effectively can't translated them without introducing a
> non-standard hook between libata and scsi to communicate that
> limit.

Why can't we do what the t10 sat document recommends: if the ATA device
doesn't support the XL version (32 bit ranges) then translate unmap to
multiple non-XL commands?

I don't necessarily object to the vendor specific 1<->1 approach, it's
just it won't fix the problem you cited above (SG_IO WRITE SAME), its
just that now we error the command, which may cause some surprise.  I
also wonder if we couldn't simply do an ATA_16 TRIM if we're already
going to all the trouble of recognising ATA devices in the sd discard
path?

James

>   And once we're down that path we might as well just do the
> right thing directly.

[PATCH] qla2xxx: use sg_virt()

2017-03-23 Thread Geliang Tang

Use sg_virt() instead of open-coding it.

Signed-off-by: Geliang Tang 
---
 drivers/scsi/qla2xxx/qla_isr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..9610d85 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1991,7 +1991,7 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
return 1;
}
 
-   spt = page_address(sg_page(sg)) + sg->offset;
+   spt = sg_virt(sg);
spt += j;
 
spt->app_tag = 0x;
-- 
2.9.3

Re: [PATCHv3 4/6] scsi_error: do not escalate failed EH command

2017-03-23 Thread Hannes Reinecke

On 03/21/2017 08:05 PM, Benjamin Block wrote:
> On Thu, Mar 16, 2017 at 12:53:45PM +0100, Hannes Reinecke wrote:
>> On 03/16/2017 12:01 PM, Benjamin Block wrote:
>>> On Wed, Mar 15, 2017 at 02:54:16PM +0100, Hannes Reinecke wrote:
 On 03/14/2017 06:56 PM, Benjamin Block wrote:
> Hello Hannes,
>
> On Wed, Mar 01, 2017 at 10:15:18AM +0100, Hannes Reinecke wrote:
>> When a command is sent as part of the error handling there
>> is not point whatsoever to start EH escalation when that
>> command fails; we are _already_ in the error handler,
>> and the escalation is about to commence anyway.
>> So just call 'scsi_try_to_abort_cmd()' to abort outstanding
>> commands and let the main EH routine handle the rest.
>>
>> Signed-off-by: Hannes Reinecke 
>> Reviewed-by: Johannes Thumshirn 
>> Reviewed-by: Bart Van Assche 
>> ---
>>  drivers/scsi/scsi_error.c | 11 +--
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index e1ca3b8..4613aa1 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -889,15 +889,6 @@ static int scsi_try_to_abort_cmd(struct 
>> scsi_host_template *hostt,
>>  return hostt->eh_abort_handler(scmd);
>>  }
>>
>> -static void scsi_abort_eh_cmnd(struct scsi_cmnd *scmd)
>> -{
>> -if (scsi_try_to_abort_cmd(scmd->device->host->hostt, scmd) != 
>> SUCCESS)
>> -if (scsi_try_bus_device_reset(scmd) != SUCCESS)
>> -if (scsi_try_target_reset(scmd) != SUCCESS)
>> -if (scsi_try_bus_reset(scmd) != SUCCESS)
>> -scsi_try_host_reset(scmd);
>> -}
>> -
>>  /**
>>   * scsi_eh_prep_cmnd  - Save a scsi command info as part of error 
>> recovery
>>   * @scmd:   SCSI command structure to hijack
>> @@ -1082,7 +1073,7 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd 
>> *scmd, unsigned char *cmnd,
>>  break;
>>  }
>>  } else if (rtn != FAILED) {
>> -scsi_abort_eh_cmnd(scmd);
>> +scsi_try_to_abort_cmd(shost->hostt, scmd);
>>  rtn = FAILED;
>>  }
>
> The idea is sound, but this implementation would cause "use-after-free"s.
>
> I only know our own LLD well enough to judge, but with zFCP there will
> always be a chance that an abort fails - be it memory pressure,
> hardware/firmware behavior or internal EH in zFCP.
>
> Calling queuecommand() will mean for us in the LLD, that we allocate a
> unique internal request struct for the scsi_cmnd (struct
> zfcp_fsf_request) and add that to our internal hash-table with
> outstanding commands. We assume this scsi_cmnd-pointer is ours till we
> complete it via scsi_done are yield it via successful EH-actions.
>
> In case the abort fails, you fail to take back the ownership over the
> scsi command. Which in turn means possible "use-after-free"s when we
> still thinks the scsi command is ours, but EH has already overwritten
> the scsi-command with the original one. When we still get an answer or
> otherwise use the scsi_cmnd-pointer we would access an invalid one.
>
 That is actually not try.
 As soon as we're calling 'scsi_try_to_abort_command()' ownership is
 assumed to reside in the SCSI midlayer;
>>>
>>> That can not be true. First of all, look at the function itself (v4.10):
>>>
>>> static int scsi_try_to_abort_cmd...
>>> {
>>> if (!hostt->eh_abort_handler)
>>> return FAILED;
>>>
>>> return hostt->eh_abort_handler(scmd);
>>> }
>>>
>>> If what you say is true, then this whole API of LLDs providing or
>>> choosing not to provide implementations for these function would be
>>> fundamentally broken.
>>> The function itself returns FAILED when there is no such function.. how
>>> is a LLD that does not implement it ever to know that you took ownership
>>> by calling scsi_try_to_abort_cmd()?
>>>
>> Well. Ok.
>>
>> _Actually_, the whole concept of 'ownership' in SCSI EH is a bit flaky.
>>
>> There are two ways of entering the error handler:
>> - SCSI command timeout
>> - Failing to evaluate the SCSI command status
>>
>> For the latter case ownership already _is_ with the SCSI midlayer, as
>> the LLDD called 'scsi_done' and with that moved ownership to the midlayer.
>>
>> The interesting part is command timeout.
>> Once a command timeout triggers the block layer is calling
>> 'blk_mark_rq_complete' to avoid any concurrent completions.
>> IE any calls to scsi_done() will be short-circuited with that,
>> effectively transferring ownership to SCSI midlayer.
>>
>> Now the SCSI midlayer has to inform the LLDD that it ha

[PATCH] qla2xxx: remove redundant check on sess being non-null

2017-03-23 Thread Colin King

From: Colin Ian King 

All error paths to label out_term2 result in sess being null, so the
check for sess being non-null and the call to put_sess is dead code
and can therefore be removed.

Detected with CoverityScan, CID#1420664 ("Logically dead code")

Signed-off-by: Colin Ian King 
---
 drivers/scsi/qla2xxx/qla_target.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_target.c 
b/drivers/scsi/qla2xxx/qla_target.c
index 0e03ca2ab3e5..32e07f21ddb2 100644
--- a/drivers/scsi/qla2xxx/qla_target.c
+++ b/drivers/scsi/qla2xxx/qla_target.c
@@ -5853,8 +5853,6 @@ static void qlt_abort_work(struct qla_tgt *tgt,
return;
 
 out_term2:
-   if (sess)
-   ha->tgt.tgt_ops->put_sess(sess);
spin_unlock_irqrestore(&ha->tgt.sess_lock, flags2);
 
 out_term:
-- 
2.11.0

Re: [PATCH] scsi: fcoe: sanity check string size for store_ctrl_mode option

2017-03-23 Thread Dan Carpenter

On Wed, Mar 22, 2017 at 07:42:08PM +, Colin Ian King wrote:
> On 22/03/17 19:39, Dan Carpenter wrote:
> > On Wed, Mar 22, 2017 at 02:01:37PM +, Colin King wrote:
> >> From: Colin Ian King 
> >>
> >> Reading and writing to mode[count - 1] implies the count should not
> >> be less than 1 so add a sanity check for this.
> >>
> >> Detected with CoverityScan, CID#1357345 ("Overflowed array index write")
> >>
> >> Signed-off-by: Colin Ian King 
> > 
> > This is harmless, of course, but count can't be zero.  This is a sysfs
> > file so we test for zero size writes in sysfs_kf_write() and return
> > early.
> 
> Ah, thanks for pointing out that. I overlooked that detail.
> 

The only reason I know this stuff is because it's annotated in Smatch.
So I do this:

diff --git a/drivers/scsi/fcoe/fcoe_sysfs.c b/drivers/scsi/fcoe/fcoe_sysfs.c
index 9cf3d56296ab..c491ad8fb0a8 100644
--- a/drivers/scsi/fcoe/fcoe_sysfs.c
+++ b/drivers/scsi/fcoe/fcoe_sysfs.c
@@ -281,6 +281,7 @@ static ssize_t show_ctlr_mode(struct device *dev,
"%s\n", name);
 }
 
+#include "/home/dcarpenter/progs/smatch/devel/check_debug.h"
 static ssize_t store_ctlr_mode(struct device *dev,
   struct device_attribute *attr,
   const char *buf, size_t count)
@@ -288,6 +289,7 @@ static ssize_t store_ctlr_mode(struct device *dev,
struct fcoe_ctlr_device *ctlr = dev_to_ctlr(dev);
char mode[FCOE_MAX_MODENAME_LEN + 1];
 
+   __smatch_implied(count);
if (count > FCOE_MAX_MODENAME_LEN)
return -EINVAL;
 

Then when I run kchecker drivers/scsi/fcoe/fcoe_sysfs.c, it tells me:

drivers/scsi/fcoe/fcoe_sysfs.c:292 store_ctlr_mode() implied: count = 
'1-10,2147479552'

Which is sort of surprising...  The 10 value is a hack I made so
that it would never complain that "off + count" will wrap.  But
apparently something has changed so it's also picking up the true limit
of count which is 2147479552.

Then I run the following commands to view the call tree:
smdb.py store_ctlr_mode
smdb.py dev_attr_store
smdb.py sysfs_kf_write
I have some vim macros so I can look these up really quickly.

regards,
dan carpenter

[PATCH] scsi: ufs: fix wrong/ambiguous fall through comments

2017-03-23 Thread kusumi . tomohiro

From: Tomohiro Kusumi 

These aren't really falling through to anywhere meaningful.

Signed-off-by: Tomohiro Kusumi 
---
 drivers/scsi/ufs/ufshcd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index dc6efbd..b7e5128 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -553,15 +553,14 @@ static inline u32 ufshcd_get_intr_mask(struct ufs_hba 
*hba)
case UFSHCI_VERSION_10:
intr_mask = INTERRUPT_MASK_ALL_VER_10;
break;
-   /* allow fall through */
case UFSHCI_VERSION_11:
case UFSHCI_VERSION_20:
intr_mask = INTERRUPT_MASK_ALL_VER_11;
break;
-   /* allow fall through */
case UFSHCI_VERSION_21:
default:
intr_mask = INTERRUPT_MASK_ALL_VER_21;
+   break;
}
 
return intr_mask;
-- 
2.9.3

[PATCH] scsi: osd_uld: remove an unneeded NULL check

2017-03-23 Thread Dan Carpenter

We don't call the remove() function unless probe() succeeds so "oud"
can't be NULL here.  Plus, if it were NULL, we dereference it on the
next line so it would crash anyway.

Signed-off-by: Dan Carpenter 

diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c
index 4101c3178411..8b9941a5687a 100644
--- a/drivers/scsi/osd/osd_uld.c
+++ b/drivers/scsi/osd/osd_uld.c
@@ -507,10 +507,9 @@ static int osd_remove(struct device *dev)
struct scsi_device *scsi_device = to_scsi_device(dev);
struct osd_uld_device *oud = dev_get_drvdata(dev);
 
-   if (!oud || (oud->od.scsi_device != scsi_device)) {
-   OSD_ERR("Half cooked osd-device %p,%p || %p!=%p",
-   dev, oud, oud ? oud->od.scsi_device : NULL,
-   scsi_device);
+   if (oud->od.scsi_device != scsi_device) {
+   OSD_ERR("Half cooked osd-device %p, || %p!=%p",
+   dev, oud->od.scsi_device, scsi_device);
}
 
cdev_device_del(&oud->cdev, &oud->class_dev);

Re: [PATCH] scsi: pm8001: build in relevant functions and code on PM8001_USE_MSIX

2017-03-23 Thread Jinpu Wang

On Wed, Mar 22, 2017 at 7:50 PM, Colin King  wrote:
> From: Colin Ian King 
>
> Currently the misx and intx variables of the interrupt enable/disable
> helper functions are built in no matter what the setting of the
> macro PM8001_USE_MSIX.  Clean this up by just building in the
> necessary helper functions and calls to these functions depending on
> the setting of PM8001_USE_MSIX.  This addresses several dead code
> paths found by static analysis with CoverityScan.
>
> Signed-off-by: Colin Ian King 
> ---
>  drivers/scsi/pm8001/pm8001_hwi.c | 63 
> +---
>  1 file changed, 33 insertions(+), 30 deletions(-)

Thanks Colin,

Mind to also change pm80xx_hw.c?

-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

Re: [PATCH] scsi: pm8001: build in relevant functions and code on PM8001_USE_MSIX

2017-03-23 Thread Jinpu Wang

On Thu, Mar 23, 2017 at 9:34 AM, walter harms  wrote:

>> @@ -4613,15 +4616,15 @@ static int pm8001_chip_phy_ctl_req(struct 
>> pm8001_hba_info *pm8001_ha,
>>
>>  static u32 pm8001_chip_is_our_interupt(struct pm8001_hba_info *pm8001_ha)
>>  {
>> - u32 value;
>>  #ifdef PM8001_USE_MSIX
>>   return 1;
>> -#endif
>> - value = pm8001_cr32(pm8001_ha, 0, MSGU_ODR);
>> +#else
>> + u32 value = pm8001_cr32(pm8001_ha, 0, MSGU_ODR);
>> +
>>   if (value)
>>   return 1;
>>   return 0;
>> -
>> +#endif
>>  }
>>
>
> This is a bit strange, why do this function return u32 ?
>
> re,
>  wh
>
>>  /**

Hi Walter,

Feel free to submit a patch to change it to bool.

Thanks

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

Re: [PATCH] scsi: pm8001: build in relevant functions and code on PM8001_USE_MSIX

2017-03-23 Thread walter harms



Am 22.03.2017 19:50, schrieb Colin King:
> From: Colin Ian King 
> 
> Currently the misx and intx variables of the interrupt enable/disable
> helper functions are built in no matter what the setting of the
> macro PM8001_USE_MSIX.  Clean this up by just building in the
> necessary helper functions and calls to these functions depending on
> the setting of PM8001_USE_MSIX.  This addresses several dead code
> paths found by static analysis with CoverityScan.
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/scsi/pm8001/pm8001_hwi.c | 63 
> +---
>  1 file changed, 33 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/scsi/pm8001/pm8001_hwi.c 
> b/drivers/scsi/pm8001/pm8001_hwi.c
> index 10546faac58c..d1be10fd1350 100644
> --- a/drivers/scsi/pm8001/pm8001_hwi.c
> +++ b/drivers/scsi/pm8001/pm8001_hwi.c
> @@ -1204,26 +1204,7 @@ void pm8001_chip_iounmap(struct pm8001_hba_info 
> *pm8001_ha)
>   }
>  }
>  
> -/**
> - * pm8001_chip_interrupt_enable - enable PM8001 chip interrupt
> - * @pm8001_ha: our hba card information
> - */
> -static void
> -pm8001_chip_intx_interrupt_enable(struct pm8001_hba_info *pm8001_ha)
> -{
> - pm8001_cw32(pm8001_ha, 0, MSGU_ODMR, ODMR_CLEAR_ALL);
> - pm8001_cw32(pm8001_ha, 0, MSGU_ODCR, ODCR_CLEAR_ALL);
> -}
> -
> - /**
> -  * pm8001_chip_intx_interrupt_disable- disable PM8001 chip interrupt
> -  * @pm8001_ha: our hba card information
> -  */
> -static void
> -pm8001_chip_intx_interrupt_disable(struct pm8001_hba_info *pm8001_ha)
> -{
> - pm8001_cw32(pm8001_ha, 0, MSGU_ODMR, ODMR_MASK_ALL);
> -}
> +#ifdef PM8001_USE_MSIX
>  
>  /**
>   * pm8001_chip_msix_interrupt_enable - enable PM8001 chip interrupt
> @@ -1257,6 +1238,30 @@ pm8001_chip_msix_interrupt_disable(struct 
> pm8001_hba_info *pm8001_ha,
>   pm8001_cw32(pm8001_ha, 0,  msi_index, MSIX_INTERRUPT_DISABLE);
>  }
>  
> +#else
> +
> +/**
> + * pm8001_chip_interrupt_enable - enable PM8001 chip interrupt
> + * @pm8001_ha: our hba card information
> + */
> +static void
> +pm8001_chip_intx_interrupt_enable(struct pm8001_hba_info *pm8001_ha)
> +{
> + pm8001_cw32(pm8001_ha, 0, MSGU_ODMR, ODMR_CLEAR_ALL);
> + pm8001_cw32(pm8001_ha, 0, MSGU_ODCR, ODCR_CLEAR_ALL);
> +}
> +
> + /**
> +  * pm8001_chip_intx_interrupt_disable- disable PM8001 chip interrupt
> +  * @pm8001_ha: our hba card information
> +  */
> +static void
> +pm8001_chip_intx_interrupt_disable(struct pm8001_hba_info *pm8001_ha)
> +{
> + pm8001_cw32(pm8001_ha, 0, MSGU_ODMR, ODMR_MASK_ALL);
> +}
> +#endif
> +
>  /**
>   * pm8001_chip_interrupt_enable - enable PM8001 chip interrupt
>   * @pm8001_ha: our hba card information
> @@ -1266,10 +1271,9 @@ pm8001_chip_interrupt_enable(struct pm8001_hba_info 
> *pm8001_ha, u8 vec)
>  {
>  #ifdef PM8001_USE_MSIX
>   pm8001_chip_msix_interrupt_enable(pm8001_ha, 0);
> - return;
> -#endif
> +#else
>   pm8001_chip_intx_interrupt_enable(pm8001_ha);
> -
> +#endif
>  }
>  
>  /**
> @@ -1281,10 +1285,9 @@ pm8001_chip_interrupt_disable(struct pm8001_hba_info 
> *pm8001_ha, u8 vec)
>  {
>  #ifdef PM8001_USE_MSIX
>   pm8001_chip_msix_interrupt_disable(pm8001_ha, 0);
> - return;
> -#endif
> +#else
>   pm8001_chip_intx_interrupt_disable(pm8001_ha);
> -
> +#endif
>  }
>  
>  /**
> @@ -4613,15 +4616,15 @@ static int pm8001_chip_phy_ctl_req(struct 
> pm8001_hba_info *pm8001_ha,
>  
>  static u32 pm8001_chip_is_our_interupt(struct pm8001_hba_info *pm8001_ha)
>  {
> - u32 value;
>  #ifdef PM8001_USE_MSIX
>   return 1;
> -#endif
> - value = pm8001_cr32(pm8001_ha, 0, MSGU_ODR);
> +#else
> + u32 value = pm8001_cr32(pm8001_ha, 0, MSGU_ODR);
> +
>   if (value)
>   return 1;
>   return 0;
> -
> +#endif
>  }
>  

This is a bit strange, why do this function return u32 ?

re,
 wh

>  /**

70 matches

Mail list logo