RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write

2018-04-13 Thread Long Li
> Subject: RE: [Resend Patch 3/3] Storvsc: Select channel based on available
> percentage of ring buffer to write
> 
> > -Original Message-
> > From: linux-kernel-ow...@vger.kernel.org
> >  On Behalf Of Long Li
> > Sent: Tuesday, March 27, 2018 5:49 PM
> > To: KY Srinivasan ; Haiyang Zhang
> > ; Stephen Hemminger
> ;
> > James E . J . Bottomley ; Martin K . Petersen
> > ; de...@linuxdriverproject.org; linux-
> > s...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > net...@vger.kernel.org
> > Cc: Long Li 
> > Subject: [Resend Patch 3/3] Storvsc: Select channel based on available
> > percentage of ring buffer to write
> >
> > From: Long Li 
> >
> > This is a best effort for estimating on how busy the ring buffer is
> > for that channel, based on available buffer to write in percentage. It
> > is still possible that at the time of actual ring buffer write, the
> > space may not be available due to other processes may be writing at the
> time.
> >
> > Selecting a channel based on how full it is can reduce the possibility
> > that a ring buffer write will fail, and avoid the situation a channel
> > is over busy.
> >
> > Now it's possible that storvsc can use a smaller ring buffer size
> > (e.g. 40k bytes) to take advantage of cache locality.
> >
> > Signed-off-by: Long Li 
> > ---
> >  drivers/scsi/storvsc_drv.c | 62
> > +-
> >  1 file changed, 50 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index a2ec0bc9e9fa..b1a87072b3ab 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size,
> "Ring
> > buffer size (bytes)");
> >
> >  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
> > MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs
> to
> > subchannels");
> > +
> > +static int ring_avail_percent_lowater = 10;
> 
> Reserving 10% of each ring buffer by default seems like more than is needed
> in the storvsc driver.  That would be about 4Kbytes for the 40K ring buffer
> you suggest, and even more for a ring buffer of 128K.  Each outgoing record is
> only about 344 bytes (I'd have to check exactly).  With the new channel
> selection algorithm below, the only time we use a channel that is already
> below the low water mark is when no channel could be found that is above
> the low water mark.   There could be a case of two or more threads deciding
> that a channel is above the low water mark at the same time and both
> choosing it, but that's likely to be rare.  So it seems like we could set the

It's not rare for two processes checking on the same channel at the same time, 
when running multiple processes I/O workload. The CPU to channel is not 1:1 
mapping.

> default low water mark to 5 percent or even 3 percent, which will let more of
> the ring buffer be used, and let a channel be assigned according to the
> algorithm, rather than falling through to the default because all channels
> appear to be "full".

It seems it's not about how big ring buffer is, e.g. even you have a ring 
buffer of infinite size, it won't help with performance if it's getting queued 
all the time, while other ring buffers are near empty. It's more about how 
multiple ring buffers are getting utilized in a reasonable and balanced way. 
Testing shows 10 is a good choice, while 3 is prone to return BUSY and trigger 
block layer retry.

> 
> > +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> > +MODULE_PARM_DESC(ring_avail_percent_lowater,
> > +   "Select a channel if available ring size > this in percent");
> > +
> >  /*
> >   * Timeout in seconds for all devices managed by this driver.
> >   */
> > @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device
> > *device,  {
> > struct storvsc_device *stor_device;
> > struct vstor_packet *vstor_packet;
> > -   struct vmbus_channel *outgoing_channel;
> > +   struct vmbus_channel *outgoing_channel, *channel;
> > int ret = 0;
> > -   struct cpumask alloced_mask;
> > +   struct cpumask alloced_mask, other_numa_mask;
> > int tgt_cpu;
> >
> > vstor_packet = >vstor_packet; @@ -1301,22 +1307,53 @@
> > static int storvsc_do_io(struct hv_device *device,
> > /*
> >  * Select an an appropriate channel to send the request out.
> >  */
> > -
> > if (stor_device->stor_chns[q_num] != NULL) {
> > outgoing_channel = stor_device->stor_chns[q_num];
> > -   if (outgoing_channel->target_cpu == smp_processor_id()) {
> > +   if (outgoing_channel->target_cpu == q_num) {
> > /*
> >  * Ideally, we want to pick a different channel if
> >  

Re: [PATCHv2] tcmu: allow userspace to reset netlink

2018-04-13 Thread Xiubo Li

On 2018/4/14 1:21, Mike Christie wrote:

On 04/12/2018 10:08 PM, Xiubo Li wrote:

+
+if (val != 1) {
+pr_err("Invalid block value %d\n", val);

I think you wanted

"Invalid reset value %d\n"

Yeah, just copied it from other place.

+return -EINVAL;
+}
+
+spin_unlock(>nl_cmd_lock);

Need spin_lock() instead of unlock.


I think before calling the code below you need to check if a nl command
is even waiting. If you just run this with no nl commands waiting then
next time we send a command nl_cmd->complete will be true and
tcmu_wait_genl_cmd_reply's wait_event call will return right away.

Will fix this.


+nl_cmd->complete = true;
+nl_cmd->status = -EINTR;
+wake_up(>complete_wq);

Instead of the code above, I think you need to take some of the guts out
of tcmu_genl_cmd_done to handle removal which is an odd cases due to the
refcoutning/locking and configfs use.

For non removal commands we need to do a target_undepend_item. For
remove, we don't want to since we came through configfs and teardown
already started.

So instead of the above lines take:

  if (nl_cmd->cmd != completed_cmd) {
  printk(KERN_ERR "Mismatched commands (Expecting reply
for %d. Current %d).\n",
 completed_cmd, nl_cmd->cmd);
  ret = -EINVAL;
  } else {
  nl_cmd->status = rc;
  }

 // Note: I changed this from the is_removed
  if (completed_cmd != TCMU_CMD_REMOVED_DEVICE)
   target_undepend_item(>dev_group.cg_item);
  if (!ret) {
  nl_cmd->complete = true;
  wake_up(>complete_wq);
  }


Sorry, this part I couldn't totally understand.

Do you mean take the code above you pasted by introducing
target_undepend_item() for none removal paths just like in
tcmu_genl_cmd_done() does?
While in tcmu_genl_cmd_done(), the target_depend_item() is called by
target_find_device() and then it should do target_undepend_item() after
that, but in
reset_netlink_store(), should it need target_undepend_item() ? If so
they won't be in pairs.


Yes, you are right. I thought we took a count when we sent the nl cmd.


I think we should only take this part:

 if (nl_cmd->cmd != completed_cmd) {
 printk(KERN_ERR "Mismatched commands (Expecting reply
for %d. Current %d).\n",
completed_cmd, nl_cmd->cmd);
 ret = -EINVAL;
 } else {
 nl_cmd->status = rc;
 }

from the tcmu_genl_cmd_done(), right ?


We want the complete and wake up part too right?

Yeah, I will fix them all you mentioned above.

Thanks,
BRs



  nl_cmd->complete = true;
  wake_up(>complete_wq);



Thanks,
BRs



from tcmu_genl_cmd_done and make a function that takes the status code
and completed_cmd. You would just then do

__tcmu_genl_cmd_done(>curr_nl_cmd.cmd, -EINTR);





+spin_unlock(>nl_cmd_lock);
+
+return count;
+}
+CONFIGFS_ATTR_WO(tcmu_, reset_netlink);
+
   static ssize_t tcmu_reset_ring_store(struct config_item *item,
const char *page,
size_t count)
   {
@@ -2363,6 +2400,7 @@ static ssize_t tcmu_reset_ring_store(struct
config_item *item, const char *page,
   static struct configfs_attribute *tcmu_action_attrs[] = {
   _attr_block_dev,
   _attr_reset_ring,
+_attr_reset_netlink,
   NULL,
   };
  

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





RE: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring buffer percentage

2018-04-13 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, March 27, 2018 5:49 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-ker...@vger.kernel.org; net...@vger.kernel.org
> Cc: Long Li 
> Subject: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring 
> buffer
> percentage
> 
> From: Long Li 
> 
> In Vmbus, we have defined a function to calculate available ring buffer
> percentage to write.
> 
> Use that function and remove netvsc's private version.
> 
> Signed-off-by: Long Li 

Reviewed-by:  Michael Kelley 

> ---
>  drivers/net/hyperv/hyperv_net.h |  1 -
>  drivers/net/hyperv/netvsc.c | 17 +++--
>  drivers/net/hyperv/netvsc_drv.c |  3 ---
>  3 files changed, 3 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index cd538d5a7986..a0199ab13d67 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -189,7 +189,6 @@ struct netvsc_device;
>  struct net_device_context;
> 
>  extern u32 netvsc_ring_bytes;
> -extern struct reciprocal_value netvsc_ring_reciprocal;
> 
>  struct netvsc_device *netvsc_device_add(struct hv_device *device,
>   const struct netvsc_device_info *info);
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 0265d703eb03..8af0069e4d8c 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -31,7 +31,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
> 
> @@ -590,17 +589,6 @@ void netvsc_device_remove(struct hv_device *device)
>  #define RING_AVAIL_PERCENT_HIWATER 20
>  #define RING_AVAIL_PERCENT_LOWATER 10
> 
> -/*
> - * Get the percentage of available bytes to write in the ring.
> - * The return value is in range from 0 to 100.
> - */
> -static u32 hv_ringbuf_avail_percent(const struct hv_ring_buffer_info 
> *ring_info)
> -{
> - u32 avail_write = hv_get_bytes_to_write(ring_info);
> -
> - return reciprocal_divide(avail_write  * 100, netvsc_ring_reciprocal);
> -}
> -
>  static inline void netvsc_free_send_slot(struct netvsc_device *net_device,
>u32 index)
>  {
> @@ -649,7 +637,8 @@ static void netvsc_send_tx_complete(struct netvsc_device
> *net_device,
>   wake_up(_device->wait_drain);
> 
>   if (netif_tx_queue_stopped(netdev_get_tx_queue(ndev, q_idx)) &&
> - (hv_ringbuf_avail_percent(>outbound) >
> RING_AVAIL_PERCENT_HIWATER ||
> + (hv_get_avail_to_write_percent(>outbound) >
> +  RING_AVAIL_PERCENT_HIWATER ||
>queue_sends < 1)) {
>   netif_tx_wake_queue(netdev_get_tx_queue(ndev, q_idx));
>   ndev_ctx->eth_stats.wake_queue++;
> @@ -757,7 +746,7 @@ static inline int netvsc_send_pkt(
>   struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx);
>   u64 req_id;
>   int ret;
> - u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound);
> + u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound);
> 
>   nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
>   if (skb)
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index faea0be18924..b0b1c2fd2b7b 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -35,7 +35,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
>  #include 
> @@ -55,7 +54,6 @@ static unsigned int ring_size __ro_after_init = 128;
>  module_param(ring_size, uint, S_IRUGO);
>  MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)");
>  unsigned int netvsc_ring_bytes __ro_after_init;
> -struct reciprocal_value netvsc_ring_reciprocal __ro_after_init;
> 
>  static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE |
>   NETIF_MSG_LINK | NETIF_MSG_IFUP |
> @@ -2186,7 +2184,6 @@ static int __init netvsc_drv_init(void)
>   ring_size);
>   }
>   netvsc_ring_bytes = ring_size * PAGE_SIZE;
> - netvsc_ring_reciprocal = reciprocal_value(netvsc_ring_bytes);
> 
>   ret = vmbus_driver_register(_drv);
>   if (ret)
> --
> 2.14.1



RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write

2018-04-13 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, March 27, 2018 5:49 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-ker...@vger.kernel.org; net...@vger.kernel.org
> Cc: Long Li 
> Subject: [Resend Patch 3/3] Storvsc: Select channel based on available 
> percentage of ring
> buffer to write
> 
> From: Long Li 
> 
> This is a best effort for estimating on how busy the ring buffer is for
> that channel, based on available buffer to write in percentage. It is still
> possible that at the time of actual ring buffer write, the space may not be
> available due to other processes may be writing at the time.
> 
> Selecting a channel based on how full it is can reduce the possibility that
> a ring buffer write will fail, and avoid the situation a channel is over
> busy.
> 
> Now it's possible that storvsc can use a smaller ring buffer size
> (e.g. 40k bytes) to take advantage of cache locality.
> 
> Signed-off-by: Long Li 
> ---
>  drivers/scsi/storvsc_drv.c | 62 
> +-
>  1 file changed, 50 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index a2ec0bc9e9fa..b1a87072b3ab 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer 
> size
> (bytes)");
> 
>  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
>  MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to 
> subchannels");
> +
> +static int ring_avail_percent_lowater = 10;

Reserving 10% of each ring buffer by default seems like more than is needed
in the storvsc driver.  That would be about 4Kbytes for the 40K ring buffer
you suggest, and even more for a ring buffer of 128K.  Each outgoing record
is only about 344 bytes (I'd have to check exactly).  With the new channel
selection algorithm below, the only time we use a channel that is already
below the low water mark is when no channel could be found that is above
the low water mark.   There could be a case of two or more threads deciding
that a channel is above the low water mark at the same time and both
choosing it, but that's likely to be rare.  So it seems like we could set the
default low water mark to 5 percent or even 3 percent, which will let more
of the ring buffer be used, and let a channel be assigned according to the
algorithm, rather than falling through to the default because all channels
appear to be "full".

> +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> +MODULE_PARM_DESC(ring_avail_percent_lowater,
> + "Select a channel if available ring size > this in percent");
> +
>  /*
>   * Timeout in seconds for all devices managed by this driver.
>   */
> @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device *device,
>  {
>   struct storvsc_device *stor_device;
>   struct vstor_packet *vstor_packet;
> - struct vmbus_channel *outgoing_channel;
> + struct vmbus_channel *outgoing_channel, *channel;
>   int ret = 0;
> - struct cpumask alloced_mask;
> + struct cpumask alloced_mask, other_numa_mask;
>   int tgt_cpu;
> 
>   vstor_packet = >vstor_packet;
> @@ -1301,22 +1307,53 @@ static int storvsc_do_io(struct hv_device *device,
>   /*
>* Select an an appropriate channel to send the request out.
>*/
> -
>   if (stor_device->stor_chns[q_num] != NULL) {
>   outgoing_channel = stor_device->stor_chns[q_num];
> - if (outgoing_channel->target_cpu == smp_processor_id()) {
> + if (outgoing_channel->target_cpu == q_num) {
>   /*
>* Ideally, we want to pick a different channel if
>* available on the same NUMA node.
>*/
>   cpumask_and(_mask, _device->alloced_cpus,
>   cpumask_of_node(cpu_to_node(q_num)));
> - for_each_cpu_wrap(tgt_cpu, _mask,
> - outgoing_channel->target_cpu + 1) {
> - if (tgt_cpu != outgoing_channel->target_cpu) {
> - outgoing_channel =
> - stor_device->stor_chns[tgt_cpu];
> - break;
> +
> + for_each_cpu_wrap(tgt_cpu, _mask, q_num + 1) {
> + if (tgt_cpu == q_num)
> + continue;
> + channel = 

Re: [PATCHv2] tcmu: allow userspace to reset netlink

2018-04-13 Thread Mike Christie
On 04/12/2018 10:08 PM, Xiubo Li wrote:
> 
>>> +
>>> +if (val != 1) {
>>> +pr_err("Invalid block value %d\n", val);
>> I think you wanted
>>
>> "Invalid reset value %d\n"
> Yeah, just copied it from other place.
>>
>>> +return -EINVAL;
>>> +}
>>> +
>>> +spin_unlock(>nl_cmd_lock);
>> Need spin_lock() instead of unlock.
>>
>>
>> I think before calling the code below you need to check if a nl command
>> is even waiting. If you just run this with no nl commands waiting then
>> next time we send a command nl_cmd->complete will be true and
>> tcmu_wait_genl_cmd_reply's wait_event call will return right away.
> Will fix this.
> 
>>
>>> +nl_cmd->complete = true;
>>> +nl_cmd->status = -EINTR;
>>> +wake_up(>complete_wq);
>> Instead of the code above, I think you need to take some of the guts out
>> of tcmu_genl_cmd_done to handle removal which is an odd cases due to the
>> refcoutning/locking and configfs use.
>>
>> For non removal commands we need to do a target_undepend_item. For
>> remove, we don't want to since we came through configfs and teardown
>> already started.
>>
>> So instead of the above lines take:
>>
>>  if (nl_cmd->cmd != completed_cmd) {
>>  printk(KERN_ERR "Mismatched commands (Expecting reply
>> for %d. Current %d).\n",
>> completed_cmd, nl_cmd->cmd);
>>  ret = -EINVAL;
>>  } else {
>>  nl_cmd->status = rc;
>>  }
>>
>> // Note: I changed this from the is_removed
>>  if (completed_cmd != TCMU_CMD_REMOVED_DEVICE)
>>   target_undepend_item(>dev_group.cg_item);
>>  if (!ret) {
>>  nl_cmd->complete = true;
>>  wake_up(>complete_wq);
>>  }
>>
> Sorry, this part I couldn't totally understand.
> 
> Do you mean take the code above you pasted by introducing
> target_undepend_item() for none removal paths just like in
> tcmu_genl_cmd_done() does?
> While in tcmu_genl_cmd_done(), the target_depend_item() is called by
> target_find_device() and then it should do target_undepend_item() after
> that, but in
> reset_netlink_store(), should it need target_undepend_item() ? If so
> they won't be in pairs.
> 

Yes, you are right. I thought we took a count when we sent the nl cmd.

> I think we should only take this part:
> 
> if (nl_cmd->cmd != completed_cmd) {
> printk(KERN_ERR "Mismatched commands (Expecting reply
> for %d. Current %d).\n",
>completed_cmd, nl_cmd->cmd);
> ret = -EINVAL;
> } else {
> nl_cmd->status = rc;
> }
> 
> from the tcmu_genl_cmd_done(), right ?
> 

We want the complete and wake up part too right?

>>  nl_cmd->complete = true;
>>  wake_up(>complete_wq);


> Thanks,
> BRs
> 
> 
>> from tcmu_genl_cmd_done and make a function that takes the status code
>> and completed_cmd. You would just then do
>>
>> __tcmu_genl_cmd_done(>curr_nl_cmd.cmd, -EINTR);
>>
>>
>>
>>
>>> +spin_unlock(>nl_cmd_lock);
>>> +
>>> +return count;
>>> +}
>>> +CONFIGFS_ATTR_WO(tcmu_, reset_netlink);
>>> +
>>>   static ssize_t tcmu_reset_ring_store(struct config_item *item,
>>> const char *page,
>>>size_t count)
>>>   {
>>> @@ -2363,6 +2400,7 @@ static ssize_t tcmu_reset_ring_store(struct
>>> config_item *item, const char *page,
>>>   static struct configfs_attribute *tcmu_action_attrs[] = {
>>>   _attr_block_dev,
>>>   _attr_reset_ring,
>>> +_attr_reset_netlink,
>>>   NULL,
>>>   };
>>>  
> 



Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-13 Thread Martin K. Petersen

Jinpu,

[CC:ed the mpt3sas maintainers]

The ratelimit patch is just an attempt to treat the symptom, not the
cause.

> Thanks for asking, we updated mpt3sas driver which enables DIX support
> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
> After reboot, kernel reports the IO errors from all the drives behind
> HBA, seems for almost every read IO, which turns the system unusable:
> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)

That sounds like a bug in the mpt3sas driver or firmware. I guess the
HBA could conceivably be operating a SATA device as DIX Type 0 and strip
the PI on the drive side. But that doesn't seem to be a particularly
useful mode of operation.

Jinpu: Which firmware are you running? Also, please send us the output
of:

sg_readcap -l /dev/sda
sg_inq -x /dev/sda
sg_vpd /dev/sda

Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
controller?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] target: Fix Fortify_panic kernel exception

2018-04-13 Thread Christoph Hellwig
The patch looks fine, but in general I think descriptions of what
you fixed in the code or more important than starting out with
a backtrace.

E.g. please explain what was wrong, how you fixed it and only after
that mention how it was caught.  (preferably without the whole trace)


Only boots with CONFIG_SCSI_MQ_DEFAULT=y

2018-04-13 Thread J William Piggott

Hello, please cc me I'm not subscribed.

Beginning with commit 64d513ac "scsi: use host wide tags by default" my
system will ONLY boot with CONFIG_SCSI_MQ_DEFAULT=y. Tested through
git-stable master (0adb328 v4.16).

It never gets far enough to log anything, but here are a few boot
messages that I copied by hand (so do not rely on their accuracy):
 

 Exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 Frozen

 ...

 ata2: SRST failed (errno = -16)
 ata2: reset failed, giving up
 ata2: nv: skipping hard reset on occupied port

 ...

 ata2.00: disabled
 ata2: EH complete
 
 
It doesn't always happen in the same place, but 95% of the time it
freezes late into run level 3 when the GTK immodule cache is updated.

Affected hardware messages (from a successful boot using SCSI_MQ_DEFAULT=y):
[0.184013] SCSI subsystem initialized
[0.184207] libata version 3.00 loaded.
[0.434160] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 
250)
[1.357484] sata_nv :00:0e.0: version 3.5
[1.357885] sata_nv :00:0e.0: Using SWNCQ mode
[1.358785] scsi host1: sata_nv
[1.359070] ata2: SATA max UDMA/133 cmd 0xe80 ctl 0xe00 bmdma 0xe008 irq 21
[4.848023] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[4.867282] ata2.00: ATA-7: SAMSUNG HD502HI, 1AG01118, max UDMA7
[4.867723] ata2.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32)
[4.887272] ata2.00: configured for UDMA/133
[4.888439] scsi 1:0:0:0: Direct-Access ATA  SAMSUNG HD502HI  1118 
PQ: 0 ANSI: 5
[4.890456] sd 1:0:0:0: [sdf] 976773168 512-byte logical blocks: (500 GB/466 
GiB)
[4.890926] sd 1:0:0:0: [sdf] Write Protect is off
[4.891370] sd 1:0:0:0: [sdf] Mode Sense: 00 3a 00 00
[4.893682] sd 1:0:0:0: [sdf] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
[4.896117]  sdf: sdf1 sdf2

Thanks,
 William


Re: [PATCH] scsi_transport_srp: Fix shost to rport translation

2018-04-13 Thread Bart Van Assche
On Fri, 2018-04-13 at 08:11 +0200, Hannes Reinecke wrote:
> On Thu, 12 Apr 2018 13:32:07 -0600
> "Bart Van Assche"  wrote:
> > +static int find_child_rport(struct device *dev, void *data)
> > +{
> > +   struct device **child = data;
> > +
> > +   if (scsi_is_srp_rport(dev)) {
> > +   WARN_ON_ONCE(*child);
> > +   *child = dev;
> > +   }
> > +   return 0;
> > +}
> > +
> 
> Why not have 'static struct device *find_child_rport() ?

Hello Hannes,

The function device_for_each_child() expects to be passed a int (*)(struct 
device *,
void *) pointer. Is there perhaps another function for iterating over device 
children
that accepts a function that returns a pointer?

Thanks,

Bart.





Re: MPT Fusion - ext4 delayed allocation failed errors on 4.14!

2018-04-13 Thread Nikola Ciprich

Hi Martin,

thanks for explanation. but disabling write same for now is safe, right?

to answer your questions:

drives are SATA.

root@siv-70140:~ # sg_inq /dev/sda
standard INQUIRY:
  PQual=0  Device_type=0  RMB=0  version=0x02  [SCSI-2]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=1  Sync=1  Linked=0  [TranDis=0]  CmdQue=1
length=57 (0x39), but only fetched 56 bytes   Peripheral device type: disk
 Vendor identification: LSILOGIC
 Product identification: Logical Volume  
 Product revision level: 3000


root@siv-70140:~ # sg_vpd /dev/sda
Supported VPD pages VPD page:
  Supported VPD pages [sv]
  Device identification [di]
  Supported VPD pages [sv]
  Supported VPD pages [sv]

BR

nik


On Thu, Apr 12, 2018 at 10:31:58PM -0400, Martin K. Petersen wrote:
> 
> Nikola,
> 
> > hmm, further checking 4.4.52, it was not really OK too:
> >
> > [ 1895.084325] blk_update_request: I/O error, dev sda, sector 239461608
> > [ 1895.101099] sd 4:1:1:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK
> > driverbyte=DRIVER_SENSE
> > [ 1895.101101] sd 4:1:1:0: [sda] tag#1 Sense Key : Illegal Request 
> > [current] 
> > [ 1895.101103] sd 4:1:1:0: [sda] tag#1 Add. Sense: No additional sense 
> > information
> 
> It's the "No additional sense information" that's the problem.
> 
> Under normal circumstances the drive would return "INVALID COMMAND
> OPCODE" or "INVALID FIELD IN CDB" in the additional sense data. That
> would cause us to disable WRITE SAME for the drive in question and
> report to the block layer that it should zero the blocks manually.
> 
> However, because we don't get any of the expected sense data back in
> your case, the failure manifests itself as a generic I/O failure to the
> stack. And that's why the error is reported up to ext4 instead of the
> block layer transparently doing fallback zeroing after WRITE SAME has
> been disabled.
> 
> What kind of drive is attached to your controller? SATA? SAS?
> 
> Please send us the output of:
> 
>sg_inq /dev/sda
> 
> and
> 
>sg_vpd /dev/sda
> 
> Thanks!
> 
> -- 
> Martin K. PetersenOracle Linux Engineering
> 

-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


Re: [PATCH] scsi_transport_srp: Fix shost to rport translation

2018-04-13 Thread Hannes Reinecke
On Thu, 12 Apr 2018 13:32:07 -0600
"Bart Van Assche"  wrote:

> Since an SRP remote port is attached as a child to shost->shost_gendev
> and as the only child, the translation from the shost pointer into an
> rport pointer must happen by looking up the shost child that is an
> rport. This patch fixes the following KASAN complaint:
> 
> BUG: KASAN: slab-out-of-bounds in srp_timed_out+0x57/0x110
> [scsi_transport_srp] Read of size 4 at addr 880035d3fcc0 by task
> kworker/1:0H/19
> 
> CPU: 1 PID: 19 Comm: kworker/1:0H Not tainted 4.16.0-rc3-dbg+ #1
> Workqueue: kblockd blk_mq_timeout_work
> Call Trace:
> dump_stack+0x85/0xc7
> print_address_description+0x65/0x270
> kasan_report+0x231/0x350
> srp_timed_out+0x57/0x110 [scsi_transport_srp]
> scsi_times_out+0xc7/0x3f0 [scsi_mod]
> blk_mq_terminate_expired+0xc2/0x140
> bt_iter+0xbc/0xd0
> blk_mq_queue_tag_busy_iter+0x1c7/0x350
> blk_mq_timeout_work+0x325/0x3f0
> process_one_work+0x441/0xa50
> worker_thread+0x76/0x6c0
> kthread+0x1b2/0x1d0
> ret_from_fork+0x24/0x30
> 
> Fixes: e68ca75200fe ("scsi_transport_srp: Reduce failover time")
> Signed-off-by: Bart Van Assche 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> Cc: Jason Gunthorpe 
> Cc: Doug Ledford 
> Cc: Laurence Oberman 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/scsi/scsi_transport_srp.c | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_transport_srp.c
> b/drivers/scsi/scsi_transport_srp.c index 36f6190931bc..0d0515615847
> 100644 --- a/drivers/scsi/scsi_transport_srp.c
> +++ b/drivers/scsi/scsi_transport_srp.c
> @@ -51,6 +51,8 @@ struct srp_internal {
>   struct transport_container rport_attr_cont;
>  };
>  
> +static int scsi_is_srp_rport(const struct device *dev);
> +
>  #define to_srp_internal(tmpl) container_of(tmpl, struct
> srp_internal, t) 
>  #define  dev_to_rport(d) container_of(d, struct
> srp_rport, dev) @@ -60,9 +62,24 @@ static inline struct Scsi_Host
> *rport_to_shost(struct srp_rport *r) return
> dev_to_shost(r->dev.parent); }
>  
> +static int find_child_rport(struct device *dev, void *data)
> +{
> + struct device **child = data;
> +
> + if (scsi_is_srp_rport(dev)) {
> + WARN_ON_ONCE(*child);
> + *child = dev;
> + }
> + return 0;
> +}
> +
Huh?

Why not have 'static struct device *find_child_rport() ?

Cheers,

Hannes